When does cout flush? - c++

I know endl or calling flush() will flush it. I also know that when you call cin after cout, it flushes too. And also when the program exit. Are there other situations that cout flushes?
I just wrote a simple loop, and I didn't flush it, but I can see it being printed to the screen. Why? Thanks!
for (int i =0; i<399999; i++) {
cout<<i<<"\n";
}
Also the time for it to finish is same as withendl both about 7 seconds.
for (int i =0; i<399999; i++) {
cout<<i<<endl;
}

There is no strict rule by the standard - only that endl WILL flush, but the implementation may flush at any time it "likes".
And of course, the sum of all digits in under 400K is 6 * 400K = 2.4MB, and that's very unlikely to fit in the buffer, and the loop is fast enough to run that you won't notice if it takes a while between each output. Try something like this:
for(int i = 0; i < 100; i++)
{
cout<<i<<"\n";
Sleep(1000);
}
(If you are using a Unix based OS, use sleep(1) instead - or add a loop that takes some time, etc)
Edit: It should be noted that this is not guaranteed to show any difference. I know that on my Linux machine, if you don't have a flush in this particular type of scenario, it doesn't output anything - however, some systems may do "flush on \n" or something similar.

Related

How can I print a newline without flushing the buffer?

I was experimenting with c++ trying to figure out how I could print the numbers from 0 to n as fast as possible.
At first I just printed all the numbers with a loop:
for (int i = 0; i < n; i++)
{
std::cout << i << std::endl;
}
However, I think this flushes the buffer after every single number that it outputs, and surely that must take some time, so I tried to first print all the numbers to the buffer (or actually until it's full as it seems then seems to flush automatically) and then flush it all at once. However it seems that printing a \n after flushes the buffer like the std::endl so I omitted it:
for (int i = 0; i < n; i++)
{
std::cout << i << ' ';
}
std::cout << std::endl;
This seems to run about 10 times faster than the first example. However I want to know how to store all the values in the buffer and flush it all at once rather than letting it flush every time it becomes full so I have a few questions:
Is it possible to print a newline without flushing the buffer?
How can I change the buffer size so that I could store all the values inside it and flush it at the very end?
Is this method of outputting text dumb? If so, why, and what would be a better alternative to it?
EDIT: It seems that my results were biased by a laggy system (Terminal app of a smartphone)... With a faster system the execution times show no significant difference.
TL;DR: In general, using '\n' instead of std::endl is faster since std::endl
Explanation:
std::endl causes a flushing of the buffer, whereas '\n' does not.
However, you might or might not notice any speedup whatsoever depending upon the method of testing that you apply.
Consider the following test files:
endl.cpp:
#include <iostream>
int main() {
for ( int i = 0 ; i < 1000000 ; i++ ) {
std::cout << i << std::endl;
}
}
slashn.cpp:
#include <iostream>
int main() {
for ( int i = 0 ; i < 1000000 ; i++ ) {
std::cout << i << '\n';
}
}
Both of these are compiled using g++ on my linux system and undergo the following tests:
1. time ./a.out
For endl.cpp, it takes 19.415s.
For slashn.cpp, it takes 19.312s.
2. time ./a.out >/dev/null
For endl.cpp, it takes 0.397s
For slashn.cpp, it takes 0.153s
3. time ./a.out >temp
For endl.cpp, it takes 2.255s
For slashn.cpp, it takes 0.165s
Conclusion: '\n' is definitely faster (even practically), but the difference in speed can be dependant upon other factors. In the case of a terminal window, the limiting factor seems to depend upon how fast the terminal itself can display the text. As the text is shown on screen, and auto scrolling etc needs to happen, massive slowdowns occur in the execution. On the other hand, for normal files (like the temp example above), the rate at which the buffer is being flushed affects it a lot. In the case of some special files (like /dev/null above), since the data is just sinked into a black-hole, the flushing doesn't seem to have an effect.

How to set a priority for the execution of the program in source code?

I wrote the following code, that must do search of all possible combinations of two digits in a string whose length is specified:
#include <iostream>
#include <Windows.h>
int main ()
{
using namespace std;
cout<<"Enter length of array"<<endl;
int size;
cin>>size;
int * ps=new int [size];
for (int i=0; i<size; i++)
ps[i]=3;
int k=4;
SetPriorityClass(GetCurrentProcess(), HIGH_PRIORITY_CLASS);
while (k>=0)
{
for (int bi=0; bi<size; bi++)
std::cout<<ps[bi];
std::cout<<std::endl;
int i=size-1;
if (ps[i]==3)
{
ps[i]=4;
continue;
}
if (ps[i]==4)
{
while (ps[i]==4)
{
ps[i]=3;
--i;
}
ps[i]=4;
if (i<k)
k--;
}
}
}
When programm was executing on Windows 7, I saw that load of CPU is only 10-15%, in order to make my code worked faster, i decided to change priority of my programm to High. But when i did it there was no increase in work and load of CPU stayed the same. Why CPU load doesn't change? Incorrect statement SetPriorityClass(GetCurrentProcess(), HIGH_PRIORITY_CLASS);? Or this code cannot work faster?
If your CPU is not working at it's full capacity it means that your application is not capable of using it because of causes like I/O, sleeps, memory or other device throughtput capabilties.
Most probably, however, it means that your CPU has 2+ cores and your application is single-threaded. In this case you have to go through the process of paralellizing your application, which is often neither simple nor fast.
In case of the code you posted, the most time consuming operation is actually (most probably) printing the results. Remove the cout code and see for yourself how fast the code will work.
Increasing the priority of your programm won't help much.
What you need to do is to remove the cout from your calculations. Store your computations and output them afterwards.
As others have noted it might also be that you use a multi-core machine. Anyway removing any output from your computation loop is always a first step to use 100% of your machines computation power for that and not waste cycles on output.
std::vector<int> results;
results.reserve(1000); // this should ideally match the number of results you expect
while (k>=0)
{
for (int bi=0; bi<size; bi++){
results.push_back(ps[bi]);
}
int i=size-1;
if (ps[i]==3)
{
ps[i]=4;
continue;
}
if (ps[i]==4)
{
while (ps[i]==4)
{
ps[i]=3;
--i;
}
ps[i]=4;
if (i<k)
k--;
}
}
// now here yuo can output your data
for(auto&& res : results){
cout << res << "\n"; // \n to not force flush
}
cout << endl; // now force flush
What's probably happening is you're on a multi-core/multi-thread machine and you're running on only one thread, the rest of the CPU power is just sitting idle. So you'll want to multi-thread your code. Look at boost thread.

for loop optimization c ++

this is my first time posting in this site and I hope I get some help/hint. I have an assignment where I need to optimize the performance to the inner for loop but I have no idea how to do that. the code was given in the assignment. I need to count the time(which I was able to do) and improve the performance.
Here is the code:
//header files
#define N_TIMES 200 //This is originally 200000 but changed it to test the program faster
#define ARRAY_SIZE 9973
int main (void) {
int *array = (int*)calloc(ARRAY_SIZE, sizeof(int));
int sum = 0;
int checksum = 0;
int i;
int j;
int x;
// Initialize the array with random values 0 to 13.
srand(time(NULL));
for (j=0; j < ARRAY_SIZE; j++) {
x = rand() / (int)(((unsigned)RAND_MAX + 1) / 14);
array[j] = x;
checksum += x;
}
//printf("Checksum is %d.\n",checksum);
for (i = 0; i < N_TIMES; i++) {
// Do not alter anything above this line.
// Need to optimize this for loop----------------------------------------
for (j=0; j < ARRAY_SIZE; j++) {
sum += array[j];
printf("Sum is now: %d\n",sum);
}
// Do not alter anything below this line.
// ---------------------------------------------------------------
// Check each iteration.
//
if (sum != checksum) {
printf("Checksum error!\n");
}
sum = 0;
}
return 0;
}
The code takes about 695 seconds to run. Any help on how to optimize it please?
thanks a lot.
The bottleneck in that loop is obviously the IO done by printf; since you are probably writing the output on a console, the output is line buffered, which means that the stdio buffer is flushed at each iteration, which slows down things a lot.
If you have to do all that prints, you can greatly enhance the performance by forcing the stream to do block buffering: before the for add a
setvbuf(stdout, NULL, _IOFBF, 0);
In alternative, if this approach is not considered valid, you can do your own buffering by allocating a big buffer on your own and do your own buffering: write in your buffer using sprintf, periodically emptying it in the output stream with a fwrite.
Also, you can use the poor man's approach to buffering - just use a buffer big enough to write all that stuff (you can calculate how big it must be quite easily) and write in it without worrying about when it's full, when to empty it, ... - just empty it at the end of the loop. edit: see #paxdiablo's answer for an example of this
Applying just the first optimization, what I get with time is
real 0m6.580s
user 0m0.236s
sys 0m2.400s
vs the original
real 0m8.451s
user 0m0.700s
sys 0m3.156s
So, we got down of ~3 seconds in real time, half a second in user time and ~0.7 seconds in system time. But what we can see here is the huge difference between user+sys and real, which means that the time is not spent in doing something inside the process, but waiting.
Thus, the real bottleneck here is not in our process, but in the process of the virtual terminal emulator: sending huge quantities of text to the console is going to be slow no matter what optimizations we do in our program; in other words, your task is not CPU-bound, but IO-bound, so CPU-targeted optimizations won't be of much benefit, since at the end you have to wait anyway for your IO device to do his slow stuff.
The real way to speed up such a program would be much simpler: avoid the slow IO device (the console) and just write the data to file (which, by the way, is block-buffered by default).
matteo#teokubuntu:~/cpp/test$ time ./a.out > test
real 0m0.369s
user 0m0.240s
sys 0m0.068s
Since there's absolutely no variation in that loop based on i (the outer loop), you don't need to calculate it each time.
In addition, the printing of the data should be outside the inner loop so as not to impose I/O costs on the calculation.
With those two things in mind, one possibility is:
static int sumCalculated = 0;
if (!sumCalculated) {
for (j=0; j < ARRAY_SIZE; j++) {
sum += array[j];
}
sumCalculated = 1;
}
printf("Sum is now: %d\n",sum);
although that has different output to the original which may be an issue (one line at the end rather than one line per addition).
If you do need to print the accumulating sum within the loop, I'd simply buffer that as well (since it doesn't vary each time through the i loop.
The string Sum is now: 999999999999\n (12 digits, it may vary depending on your int size) takes up 25 bytes (excluding terminating NUL). Multiply that by 9973 and you need a buffer of about 250K (including a terminating NUL). So something like this:
static char buff[250000];
static int sumCalculated = 0;
if (!sumCalculated) {
int offset = 0;
for (j=0; j < ARRAY_SIZE; j++) {
sum += array[j];
offset += sprintf (buff[offset], "Sum is now: %d\n",sum);
}
sumCalculated = 1;
}
printf ("%s", buff);
Now that sort of defeats the whole intent of the outer loop as a benchmark tool but loop-invariant removal is a valid approach to optimisation.
Move the printf outside the for loop.
// Do not alter anything above this line.
//Need to optimize this for loop----------------------------------------
for (j=0; j < ARRAY_SIZE; j++) {
sum += array[j];
}
printf("Sum is now: %d\n",sum);
// Do not alter anything below this line.
// ---------------------------------------------------------------
Getting the I/O out of the loop is a big help.
Depending on the compiler and machine, you might get a tiny increase in speed by using pointers rather than indexing (though on modern hardware, it generally doesn't make a difference).
Loop unrolling might help to increase the ratio of useful work to loop overhead.
You could use vector instructions (e.g., SIMD) to do a bunch of calculation in parallel.
Are you allowed to pack the array? Can you use an array of a smaller type than int (given that all the values are very small)? Making the array physically shorter improves locality.
Loop unrolling might look something like this:
for (int j = 0; j < ARRAY_SIZE; j += 2) {
sum += array[j] + array[j+1];
}
You'd have to figure out what to do if the array isn't an exact multiple of the unrolling size (which is probably why the assignment uses a prime number).
You would have to experiment to see how much unrolling would be the right amount.

CPU Utilization High

We are using the following method to write the log to log file. The log entries are kept in a vector named m_LogList (stl string entries are kept in the vector). The method is called when the size of the vector is more than 100. The CPU utilization of the Log server is around 20-40% if we call FlushLog method. If we comment out the FlushLog method, the CPU utilisation drops to 10-20% range.
What optimizations can I use to reduce the CPU utilization? We are using fstream object for writing the log entries to file
void CLogFileWriter::FlushLog()
{
CRCCriticalSectionLock lock(m_pFileCriticalSection);
//Entire content of the vector are writing to the file
if(0 < m_LogList.size())
{
for (int i = 0; i < (int)m_LogList.size(); ++i)
{
m_ofstreamLogFile << m_LogList[i].c_str()<<endl;
m_nSize = m_ofstreamLogFile.tellp();
if(m_pLogMngr->NeedsToBackupFile(m_nSize))
{
// Backup the log file
}
}
m_ofstreamLogFile.flush();
m_LogList.clear(); //Clearing the content of the Log List
}
}
The first optimization I'd use is to drop the .c_str() in << m_LogList[i].c_str(). It forces operator<< to do a strlen (O(n)) instead of relying on string::size (O(1)).
Also, I'd just sum string sizes, instead of calling tellp.
Finally, << endl includes a flush, on every line. Just use << '\n'. You already have the flush at the end.
I'd consider first of all dumping the log in one stdlib call like this:
std::copy(list.begin(), list.end(), std::ostream_iterator<std::string>(m_ofstreamLogFile, "\n"));
This will remove the flushing because of endl and the unnecessary conversion to a c string. CPU-wise this should be quite efficient.
You can do the backup afterwards unless you really care about a very specific limit, but even in that case I'd say: backup atsome lower threshold so that you can account for some overflow.
Also, remove if(0 < m_LogList.size()), it's not really necessary for anything.
A few comments:
if(0 < m_LogList.size())
Should be:
if(!m_LogList.empty())
Although with a vector it shouldn't make a difference.
Also you should consider moving the
m_nSize = m_ofstreamLogFile.tellp();
if(m_pLogMngr->NeedsToBackupFile(m_nSize)) { /*...*/ }
out of the loop. You don't say how much CPU it uses, but I'd bet it is heavy.
You could also iterate using iterators:
for (int i = 0; i < (int)m_LogList.size(); ++i)
Should be:
for (std::vector<std::string>::iterator it = m_LogList.begin();
it != m_LogList.end(); ++it)
Lastly, change the line:
m_ofstreamLogFile << m_LogList[i].c_str()<<endl;
Into:
m_ofstreamLogFile << m_LogList[i] << '\n';
The .c_str() is unnecesary. And the endl writes an EOL and flushes the stream. You do not want to do that, as you are flushing it at the end of the loop.

OpenMP and File I/O

I'm doing some time trials on my code, and logically it seems really easy to parallelize with OpenMP as each trial is independent of the others. As it stands, my code looks something like this:
for(int size = 30; size < 50; ++size) {
#pragma omp parallel for
for(int trial = 0; trial < 8; ++trial) {
time_t start, end;
//initializations
time(&start);
//perform computation
time(&end);
output << size << "\t" << difftime(end,start) << endl;
}
output << endl;
}
I have a sneaking suspicion that this is kind of a faux pas, however, as two threads may simultaneously write values to the output, thus screwing up the formatting. Is this a problem, and if so, will surrounding the output << size << ... code with a #pragma omp critical statement fix it?
Never mind whether your output will be screwed up (it likely will). Unless you're really careful to assign your OpenMP threads to different processors that don't share resources like memory bandwidth, your time trials aren't very meaningful either. Different runs will be interfering with each other.
The solution to the problem you're asking about is to write the result times into designated elements of an array, with one slot for each trial, and ouput the results after the fact.
As long as you don't mind the individual lines being out of order you'll be fine. OpenMP should make sure a whole line is printed at a time.
However, you will need to declare start and end as private in the pragma otherwise the threads will overwrite them and mess up your timings.