Why would code performance be better in VSCode?

Why would code performance be better in VSCode? - c++

I use g++ 12.2.0 (MinGW-w64 via MSYS2) and experience odd performance behavior when running the same code from VSCode (with Code Runner extension) and from the command line. For both, the command used is g++ f1.cpp -o f1 && f1.
For example purposes, given this code:
#include <iostream>
#include <chrono>
#include <string.h>
using
std::chrono::high_resolution_clock,
std::chrono::duration_cast,
std::chrono::nanoseconds;
void timeit(void (*func)(std::string), std::string message){
auto time_start = high_resolution_clock::now();
func(message);
auto time_end = high_resolution_clock::now();
auto duration = duration_cast<nanoseconds>(time_end - time_start).count();
std::cout << "Execution Time: " << duration * 1e-6f << " [ms]" << std::endl;
}
void run(std::string message){
std::cout << message.c_str() << std::endl;
}
int main() {
for (size_t i = 0; i < 100; i++)
timeit(&run, std::to_string(i));
}
VSCode Output:
...
96
Execution Time: 0.0025 [ms]
97
Execution Time: 0.0022 [ms]
98
Execution Time: 0.0034 [ms]
99
Execution Time: 0.0029 [ms]
Command Line Output:
...
96
Execution Time: 0.5091 [ms]
97
Execution Time: 0.5168 [ms]
98
Execution Time: 2.4943 [ms]
99
Execution Time: 0.7385 [ms]
The performance is significantly different as seen above.
It is also consistent on PowerShell and even the VSCode terminal itself.
I searched Stack Overflow and other sources for the best of my capability and yet left clueless.
Edit 1: I already tried running it with optimizations and up to 1e6 iterations and the issue still persists.
Edit 2: teapot418's answer was precise. It really was limited by the performance of the terminal and replacing std::endl with \n made a big difference in performance. Replacing the print operation with a different (and even more computationally heavy) operation showed equal performance. Colonel Thirty Two's answer raised an important point too which I should have clarified beforehand. Issue solved.

You are benchmarking the time of your run function, which is a single std::cout statement. Measuring I/O times of a single output line is going to have a lot of variance and depend on where stdout is getting routed to.
My psychic powers suggest that VSCode is re-routing stdout to its own output window which has a lot more buffering and performance than the default console window. And I'm guessing your console window is running under some sort of bash/unix emulation on Windows. (Or is this Linux?)
Some things to try:
You can try increasing the buffer size of the Windows console (or the console app you are using to host mingw)
Redirect to /dev/null or nul: on Windows. (e.g. f1 > /dev/null or f1 > nul:). Of course, you won't see the output, but you could change your Execution Time statement to be sent to cerr while cout is still used in the run function. But then again, I'm not sure what you are measuring.
Try compiling with -O2 optimizations. That might nullify the difference.
Try running your compiled program in a default console window instead of the Bash/Mingw/Cygin thing it might be running in now. You could try running it in powershell too.

Related

FIO runtime different than gettimeofday()

I am trying to measure the execution time of FIO benchmark. I am, currently, doing so wrapping the FIO call between gettimeofday():
gettimeofday(&startFioFix, NULL);
FILE* process = popen("fio --name=randwrite --ioengine=posixaio rw=randwrite --size=100M --direct=1 --thread=1 --bs=4K", "r");
gettimeofday(&doneFioFix, NULL);
and calculate the elapsed time as:
double tstart = startFioFix.tv_sec + startFioFix.tv_usec / 1000000.;
double tend = doneFioFix.tv_sec + doneFioFix.tv_usec / 1000000.;
double telapsed = (tend - tstart);
Now, the question(s) is
telapsed time is different (larger) than the runt by FIO output. Can you please help me in understanding Why? as the fact can be seen in FIO output:
randwrite: (g=0): rw=randwrite, bs=4K-4K/4K-4K/4K-4K, ioengine=posixaio, iodepth=1
fio-2.2.8
Starting 1 thread
randwrite: (groupid=0, jobs=1): err= 0: pid=3862: Tue Nov 1 18:07:50 2016
write: io=102400KB, bw=91674KB/s, iops=22918, runt= 1117msec
...
and the telapsed is:
telapsed: 1.76088 seconds
what is the actual time taken by FIO execution:
a) runt given by FIO, or
b) the elapsed time by getttimeofday()
How does FIO measure its runt? (probably, this question linked to 1.)
PS: I have tried to replace the gettimeofday(with std::chrono::high_resolution_clock::now()), but it also behaves the same (by same, I mean it also gives larger elapsed time than runt)
Thank you in advance, for your time and assistance.

A quick point:gettimeofday() on Linux uses a clock that doesn't necessarily tick at a constant interval and can even move backwards (see http://man7.org/linux/man-pages/man2/gettimeofday.2.html and https://stackoverflow.com/a/3527632/4513656 ) - this may make telapsed unreliable (or even negative).
Your gettimeofday/popen/gettimeofday measurement (telapsed) is going to be: the fio process start up (i.e. fork+exec on Linux) elapsed + fio initialisation (e.g. thread creation because I see --thread, ioengine initialisation) + fio job elapsed (runt) + fio stopping elapsed + process stop elapsed). You are comparing this to just runt which is a sub component of telapsed. It is unlikely all the non-runt components are going to happen instantly (i.e. take up 0 usecs) so the expectation is that runt will be smaller than telapsed. Try running fio with --debug=all just to see all the things it does in addition to actually submitting I/O for the job.
This is difficult to answer because it depends on what you want you mean when you say "fio execution" and why (i.e. the question is hard to interpret in an unambiguous way). Are you interested in how long fio actually spent trying to submit I/O for a given job (runt)? Are you interested in how long it takes your system to start/stop a new process that just so happens to try and submit I/O for a given period (telapsed)? Are you interested in how much CPU time was spent submitting I/O (none of the above)? So because I'm confused I'll ask you some questions instead: what are you going to use the result for and why?
Why not look at the source code? https://github.com/axboe/fio/blob/7a3b2fc3434985fa519db55e8f81734c24af274d/stat.c#L405 shows runt comes from ts->runtime[ddir]. You can see it is initialised by a call to set_epoch_time() (https://github.com/axboe/fio/blob/6be06c46544c19e513ff80e7b841b1de688ffc66/backend.c#L1664 ), is updated by update_runtime() ( https://github.com/axboe/fio/blob/6be06c46544c19e513ff80e7b841b1de688ffc66/backend.c#L371 ) which is called from thread_main().

Single thread programme apparently using multiple core

Question summary: all four cores used when running a single threaded programme. Why?
Details: I have written a non-parallelised programme in Xcode (C++). I was in the process of parallelising it, and wanted to see whether what I was doing was actually resulting in more cores being used. To that end I used Instruments to look at the core usage. To my surprise, while my application is single threaded, all four cores were being utilised.
To test whether it changed the performance, I dialled down the number of cores available to 1 (you can do it in Instruments, preferences) and the speed wasn't reduced at all. So (as I knew) the programme isn't parallelised in any way.
I can't find any information on what it means to use multiple cores to perform single threaded tasks. Am I reading the Instruments output wrong? Or is the single-threaded process being shunted between different cores for some reason (like changing lanes on a road instead of driving in two lanes at once - i.e. actual parallelisation)?
Thanks for any insight anyone can give on this.
EDIT with MWE (apologies for not doing this initially).
The following is C++ code that finds primes under 500,000, compiled in Xcode.
#include <iostream>
int main(int argc, const char * argv[]) {
clock_t start, end;
double runTime;
start = clock();
int i, num = 1, primes = 0;
int num_max = 500000;
while (num <= num_max) {
i = 2;
while (i <= num) {
if(num % i == 0)
break;
i++;
}
if (i == num){
primes++;
std::cout << "Prime: " << num << std::endl;
}
num++;
}
end = clock();
runTime = (end - start) / (double) CLOCKS_PER_SEC;
std::cout << "This machine calculated all " << primes << " under " << num_max << " in " << runTime << " seconds." << std::endl;
return 0;
}
This runs in 36s or thereabouts on my machine, as shown by the final out and my phone's stopwatch. When I profile it (using instruments launched from within Xcode) it gives a run-time of around 28s. The following image shows the core usage.
instruments showing core usage with all 4 cores (with hyper threading)
Now I reduce number of available cores to 1. Re-running from within the profiler (pressing the record button), it says a run-time of 29s; a picture is shown below.
instruments output with only 1 core available
That would accord with my theory that more cores doesn't improve performance for a single thread programme! Unfortunately, when I actually time the programme with my phone, the above took about 1 minute 30s, so there is a meaningful performance gain from having all cores switched on.
One thing that is really puzzling me, is that, if you leave the number of cores at 1, go back to Xcode and run the program, it again says it takes about 33s, but my phone says it takes 1 minute 50s. So changing the cores is doing something to the internal clock (perhaps).
Hopefully that describes the problem fully. I'm running on a 2015 15 inch MBP, with 2.2GHz i7 quad core processor. Xcode 7.3.1

I want to premise your answer lacks a lots of information in order to proceed an accurate diagnostic. Anyway I'll try to explain you the most common reason IHMO, supposing you application doesn't use 3-rd part component which perform in a multi-thread way.
I think that could be a result of scheduler effect. I'm going to explain what I mean.
Each core of the processor takes a process in the system and executed it for a "short" amount of time. This is the most common solution in desktop operative system.
Your process is executed on a single core for this amount of time and then stopped in order to allow other process to continue. When your same process is resumed it could be executed in another core (always one core, but a different one). So a poor precise task manager with a low resolution time could register the utilization of all cores, even if it does not.
In order to verify whether the cause could be that, I suggest you to see the amount of CPU % used in the time your application is running. Indeed in case of a single thread application the CPU should be about 1/#numberCore , in your case 25%.

If it's a release build your compiler may be vectorising parallelise your code. Also libraries you link against, say the standard library for example, may be threaded or vectorised.

C++ Program, Console/Terminal Output. How to implement "updating text"

I am writing a C++ program, which runs a long data analysis algorithm. It takes several days to finish running, so it is useful to have a prompt which outputs the "percentage complete" every time a new loop in the program starts so that the user (me) knows the computer isn't sitting in an infinite loop somewhere or has crashed.
At the moment I am doing this the most basic way, by computing the percentage complete as a floating point number and doing:
std::cout << "Percentage complete: " << percentage_complete << " %" << std::endl;
But, when the program has a million loops to run, this is kind of messy. In addition, if the terminal scrollback is only 1000 lines, then I lose the initial debug info printed out at the start once the program is 0.1 % complete.
I would like to copy an idea I have seen in other programs, where instead of writing a new line each time with the percentage complete, I simply replace the last line written to the terminal with the new percentage complete.
How can I do this? Is that possible? And if so, can this be done in a cross platform way? Are there several methods of doing this?
I am unsure how to describe what I am trying to do perfectly clearly, so I hope that this clear enough that you understand what I am trying to do.
To clarify, rather than seeing this:
Running program.
Debug info:
Total number of loops: 1000000
Percentage complete: 0 %
Percentage complete: 0.001 %
Percentage complete: 0.002 %
.
.
.
Percentage complete: 1.835 %
I would like to see this:
Running program.
Debug info:
Total number of loops: 1000000
Percentage complete: 1.835 %
And then on the next loop the terminal should update to this:
Running program.
Debug info:
Total number of loops: 1000000
Percentage complete: 1.836 %
I hope that's enough information.
(Okay, so this output would actually be for 100000 steps, not 1000000.)

Instead of \n or std::endl, use \r. The difference is that the latter returns the cursor to the beginning if the line without a new line.
Disclaimer (as per Lightness' objections): This is not necessarily portable, so YMMV.

Why does my program run faster when I redirect stdout?

I'm seeing something really strange. I've written a tiny code timer to capture how long blocks of code are running for. I can't post all the code, it's pretty big, but I've been through the block in question and there is nothing going near std::cout
$ bin/profiler 50
50 repetitions of Generate Config MLP took: 254 microseconds
50 repetitions of Create Population took: 5318 microseconds
50 repetitions of Create and Score Population took: 218047 microseconds
$ bin/profiler 50 > time_times
$ cat time_times
50 repetitions of Generate Config MLP took: 258 microseconds
50 repetitions of Create Population took: 5438 microseconds
50 repetitions of Create and Score Population took: 168379 microseconds
$ bin/profiler 50
50 repetitions of Generate Config MLP took: 269 microseconds
50 repetitions of Create Population took: 5447 microseconds
50 repetitions of Create and Score Population took: 216262 microseconds
$ bin/profiler 50 > time_times
$ cat time_times
50 repetitions of Generate Config MLP took: 260 microseconds
50 repetitions of Create Population took: 5321 microseconds
50 repetitions of Create and Score Population took: 169431 microseconds
Here is the block i'm using to time, the function ptr is just a link to a void function, which makes a single function call. I'm aware there are probably better ways to time something, I wanted quick and dirty so I start to improve on the code.
void timeAndDisplay(string name,function_ptr f_ptr) {
struct timeval start, end;
long mtime, seconds, useconds;
gettimeofday(&start, NULL);
// Run the code
for (unsigned x = 0; x < reps; x++) {
f_ptr();
}
gettimeofday(&end, NULL);
seconds = end.tv_sec - start.tv_sec;
useconds = end.tv_usec - start.tv_usec;
mtime = ((seconds) * 1000000 + useconds/1.0) + 0.0005;
std::cout << reps << " repetitions of " << name << " took: " << mtime << " microseconds" << std::endl;
}
I'm compiling and linking with:
g++ -c -Wall -O3 -fopenmp -mfpmath=sse -march=native src/profiler.cpp -o build/profiler.o
g++ build/*.o -lprotobuf -lgomp -lboost_system -lboost_filesystem -o bin/profiler
I was about to start making changes so I thought I would save a baseline, but the create and score population is performing differently when I redirect it!
Does anybody know what is going on?
Update 1:
First pass with profiling doesnt show anything significant. Almost all the top calls are related to the vector math which the program is running (Eigen library). The prevailing theory is that there is some blocking io for the console but the calls to std::cout are outside of the function loop and only 3 in total so I find this hard to accept it has such an impact.
Update 2:
After having this drive me crazy for some time, I gave up a bit and started to make improvements to my program with the data I had available. It got weirder, but I think I've found one the main influencing factors - the available system entropy. My program uses huge amounts of random numbers and it seems it runs at a slower pace after running through an amount of times with either method. I used a for loop to simulate both methods and although it is quicker with the stdout redirected, I suspect this tiny piece of IO bumps urandom a little which is why its faster. I'm still investigating but if anyone can point me in the right direction to prove this I'd be very grateful.

Writing to or from the Standard Console Input/Output system involves buffers and locks. So I would say that you're generally taking a performance hit because of locking buffers.
I would recommend the following Profiler for finding out whats taking the longest amount of time.

Writing to the console involves graphics manipulations and maybe also handling of carriage returns (moving to the beginning of the line) and linefeed (moving all the previous text up one line and erasing the top line).
Redirecting to a file is usually faster because the output is appended and no graphics manipulations take place.
At least that has been my experience.

Have you tried running it with the window off the screen or behind another window so it does not have to be drawn? I have been on some systems where that would sort of bypass the redrawing of the window.

n900 - maemo - timing

I am attempting to save a file every second within +- 100ms (10% error). The problem I am having is that my timing measurement is saying that execution took 1150 ms, but in reality it appears to be 3 or 4 seconds.
What's going on?
If I issue the command, sleep(1), it appears to be very accurate. However, when I measure how long something took, it must be off by quite a bit.
I am using clock() to measure program execution. All of this stuff is within a while loop.
Walter

Your problem is that clock() reports you CPU time used by your process and it is usually different from the "real" time used.
For example following code:
#include <time.h>
#include <iostream>
#include <unistd.h>
using namespace std;
int main()
{
clock_t scl = clock();
sleep(1);
cout << "CPU clock time " << clock()-scl << endl;
}
gives
time ./a.out
CPU clock time 0
real 0m1.005s
user 0m0.000s
sys 0m0.004s

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js