How negligible is the time for ofstream - c++

I have a C++ program where I do various experiments and during these experiments I output some values to a file using ofstream. The structure is basically:
Start timer
output to a file using ofstream (the output is, at most, a few words)
do some experimental work
Stop timer
My question, which is a bit broad, is can I ignore the time that the ofstream takes or it's not something negligible ? or I guess it depends ?

First of all, from your pseudo code, you could just start the timer after the file output :-) But I'm guessing it's not like that in the real app.
Beyond that, it's obviously a matter of "it depends". If you aren't outputting all that much, and the code you're interested in runs for minutes, then the output obviously won't make much of a difference. If, on the other hand, you are trying to catch runtimes measured in microseconds, you'll probably be mostly measuring the ofstream.
You could try doing various magics, like running the actual output on a thread, or just adding your messages to a previously allocated char array and outputting that at the end. However, everything incurs some runtime penalty; nothing is ever free.
Since you're not interested in measuring the actual output time, you could compile a version without the output to do measurements, and a version with the output to debug the code. EDIT: or make that a runtime option. Nothing is ever free, but an "if (OutputEnabled)" is pretty close to "free" :-)

It mostly depends from what the ofstream does... as long as it just stores data in its internal buffer it will be fast, but if gets its buffer filled and actually call the OS API to perform the write the time spent could be much bigger.
But obviously everything depends on how long does the "experimental work" take in comparison to the IO you perform, both in the case where it just writes the data to the internal buffer and when the stream is flushed; as suggested in the comment, you should time the two things independently, to see how one time compares to the other.

Something is negligible compared to something else. You compare to nothing else in your question.
I had already asked a question here so as to check the validity of my statement below, and the conclusion was that I should not keep such classification, though it still just gives a rough rough draft evaluation (that may be false in some cases):
Stack operations are 10x faster than heap memory creations that are 10x faster than graphical device operations that are 10x faster than I/O operations ( writing to a file on hard drive for example ) that are 10x faster than net communication operation...
It's is only a rough estimation. Everything has to be re-evaluated each time you code.
If the time passed in the writing to ofstream doesnt impact the whole mechanism then it can be considered negligible.
If it does impact your whole program mechanism, then it cannot be considered negligible. Obviously.

Related

How to do benchmarking for C/C++ code accurately?

I'm asking regarding answers on this question, In my answer I first just got the time before and after the loops and printed out their difference, But as an update for #cigiens answer, it seems that I've done benchmarking inaccurately by not warming up the code.
What is warming up of the code? I think what happened here is that the string was moved to the cache first and that made the benchmarking results for the following loops close to each other. In my old answer, the first benchmarking result was slower than others, since it took more time to move the string to the cache I think, Am I correct? If not, what is warming up actually doing to code and also generally speaking if possible, What should I've done else than warming up for more accurate results? or how to do benchmarking correctly for C++ code (also C if possibly the same)?
To give you an example of warm up, i've recently benchmarked some nvidia cuda kernel calls:
The execution speed seems to increase over time, probably for several reasons like the fact that the GPU frequency is variable (to save power and cooldown).
Sometimes the slower call has an even worse impact on the next call so the benchmark can be misleading.
If you need to feel safe about these points, I advice you to:
reserve all the dynamic memory (like vectors) first
make a for loop to do the same work several times before a measurement
this implies to initialize the input datas (especially random) only once before the loop and to copy them each time inside the loop to ensure that you do the same work
if you deal with complex objects with cache, i advice you to pack them in a struct and to make an array of this struct (with the same construction or cloning technique), in order to ensure that the same work is done on the same starting data in the loop
you can avoid doing the for loop and copying the datas IF you alternate two calls very often and suppose that the impact of the behavior differences will cancel each other, for example in a simulation of continuous datas like positions
concerning the measurement tools, i've always faced problems with high_resolution_clock on different machines, like the non consistency of the durations. On the contrary, the windows QueryPerformanceCounter is very good.
I hope that helps !
EDIT
I forgot to add that effectively as said in the comments, the compiler optimization behavior can be annoying to deal with. The simplest way i've found is to increment a variable depending on some non trivial operations from both the warm up and the measured datas, in order to force the sequential computation as much as possible.

What is needed to get complete, uncorrupted output with OpenMP?

I have some code compiled under the c++11 standard. I have multiple threads writing to cout. I noticed when writing many lines there would be some cases where some lines were missing (like 1 out of 2000000). I was surprised to see this since my string (outStr below) was local to each thread and I had a critical section around my writes to stdout. I noticed the problem went away when I flushed the stream.
#pragma omp critical(cout)
{
cout << outStr;
cout.flush();
}
Is this expected behaviour? What really tricked me was the fact that when I wrote a relatively small number of lines (<100000), I would always see the number of expected lines outputted.
Overall I'm not really happy with the critical section in general since I'm noticing in my profiling that it is causing a lot of contention. I'm open to any suggestions to improve my I/O.
*Edit I was under the impression that under c++11 there would be no corruption of my output so long as I synchronized my output (i.e. no interleaving or missing output when I use a critical section) but the missing lines seem to indicate that this is not a guarantee without also flushing the output.
Much of the problem here stems from the fact that flushing a stream is quite slow. So, when one thread enters the critical section, they stay in it for quite a while. By the time they finish, there are probably several other threads waiting, so it becomes a severe bottleneck.
Probably the most obvious way to prevent the problem is to have the stream itself owned by one thread, and have a thread-safe queue of things that need to be written to the stream. If you're all right with accumulating a "chunk" of data to go the stream into a string (or some other pre-decided data structure), it's even pretty trivial to do.
I'd note for the record that while you might ultimately want to use a lock-free queue, a fairly simple, old-fashioned queue based on locking will still almost certainly provide a huge improvement over what you're doing right now--inserting a string into a queue is drastically faster than flushing a stream (in the range of nanoseconds to possibly a few microseconds, where flushing is typically on the order of milliseconds).
It seems std::cout.sync_with_stdio(false); was the culprit. I hadn't expected this to be the issue but I noticed it after I made a simple test program. I had used the function in an attempt to speed up my I/O which turned out to be a bad idea.
It seems, although in a critical section, the I/O stream was not thread-safe. I was also not using any printfs in my code.
According to the documentation:
In addition, synchronized C++ streams are guaranteed to be thread-safe (individual characters output from multiple threads may interleave, but no data races occur)
It would seem that the without syncing the buffer would decide to flush at random intervals for some reason.
After removed this I was actually able to get coherent output with a simple `cout << outStr; statement without flushing. Also, I am still doing testing and its seems that I may not even need the critical section at all.

Fortran Print Statements Performance Effects

I just inherited some old Fortran code that has print statements everywhere (when it runs, the matrix streams by). I know these print statements are useless because I cannot tell what the program is printing as it is going by so fast. But is there a significant performance impact to having a lot of print statements in a Fortran program (i.e. does an overly verbose program take longer to execute)? It seems like it would as it is another line to execute, but I don't know if it is significant.
In general, yes, I/O is "relatively costly" to execute since you have to do things like formatting numbers - especially floating point numbers, even if those procedures are highly optimized. However, one of the biggest costs (the system call to actually perform the I/O after the buffer to write has been prepared) is amortized in good compilers/runtimes since the I/O statements are usually buffered by default. This helps cut down the number of system calls significantly, thus reducing delays caused by frequent context switching between your app and the OS.
That said, if you are worried about the performance hit, why don't you try to comment every PRINT or WRITE statement and see how the runtime changes? Or even better, profile your application and see the amount of time spent on I/O and related routines.

Profiling serialization code

I ran my app twice (in the VS ide). The first time it took 33seconds. I decommented obj.save which calls a lot of code and it took 87seconds. Thats some slow serialization code! I suspect two problems. The first is i do the below
template<class T> void Save_IntX(ostream& o, T v){ o.write((char*)&v,sizeof(T)); }
I call this templates hundreds of thousands of times (well maybe not that much). Does each .write() use a lock that may be slowing it down? maybe i can use a memory steam which doesnt require a lock and dump that instead? Which ostream may i use that doesnt lock and perhaps depends that its only used in a single thread?
The other suspected problem is i use dynamic_cast a lot. But i am unsure if i can work around this.
Here is a quick profiling session after converting it to use fopen instead of ostream. I wonder why i dont see the majority of my functions in this list but as you can see write is still taking the longest. Note: i just realize my output file is half a gig. oops. Maybe that is why.
I'm glad you got it figured out, but the next time you do profiling, you might want to consider a few points:
The VS profiler in sampling mode does not sample during I/O or any other time your program is blocked, so it's only really useful for CPU-bound profiling. For example, if it says a routine has 80% inclusive time, but the app is actually computing only 10% of the time, that 80% is really only 8%. Because of this, for any non-CPU-bound work, you need to use the profiler's instrumentation mode.
Assuming you did that, of all those columns of data, the one that matters is "Inclusive %", because that is the routine's true cost, in the sense that if it could be avoided, that is how much the overall time would be reduced.
Of all those rows of data, the ones likely to matter are the ones containing your routines, because your routines are the only ones you can do anything about. It looks like "Unknown Frames" are maybe your code, if your code is compiled without debugging info. In general, it's a good idea to profile with debugging info, make it fast, and then remove the debugging info.

C++ Asymptotic Profiling

I have a performance issue where I suspect one standard C library function is taking too long and causing my entire system (suite of processes) to basically "hiccup". Sure enough if I comment out the library function call, the hiccup goes away. This prompted me to investigate what standard methods there are to prove this type of thing? What would be the best practice for testing a function to see if it causes an entire system to hang for a sec (causing other processes to be momentarily starved)?
I would at least like to definitively correlate the function being called and the visible freeze.
Thanks
The best way to determine this stuff is to use a profiling tool to get the information on how long is spent in each function call.
Failing that set up a function that reserves a block of memory. Then in your code at various points, write a string to memory including the current time. (This avoids the delays associated with writing to the display).
After you have run your code, pull out the memory and parse it to deterimine how long parts of your code are taking.
I'm trying to figure out what you mean by "hiccup". I'm imagining your program does something like this:
while (...){
// 1. do some computing and/or file I/O
// 2. print something to the console or move something on the screen
}
and normally the printed or graphical output hums along in a subjectively continuous way, but sometimes it appears to freeze, while the computing part takes longer.
Is that what you meant?
If so, I suspect in the running state it is most always in step 2, but in the hiccup state it spending time in step 1.
I would comment out step 2, so it would spend nearly all it's time in the hiccup state, and then just pause it under the debugger to see what it's doing.
That technique tells you exactly what the problem is with very little effort.