Why would non-parallelized Fortran measure too long of CPU time?

Why would non-parallelized Fortran measure too long of CPU time? - fortran

My program is showing a CPU running time longer than the time the program was actually running, with no parallelization written in the code.
The code is written mostly in Fortran 90 (there's one or two later-Fortran things I added in) and compiled with my Linux machine's native gfortran compiler (--version information: GNU Fortran (GCC) 4.4.7 20120313 (Red Hat 4.4.7-17) ). I understand that gfortran compiles to later standards than 90.
When the program starts it calls call cpu_time(time_start) and before it ended it calls call cpu_time(time_end). In this case, time_end - time_start gives elapsed CPU time in seconds.
So here's the curious thing: I used HTCondor to submit my code to run on whatever machine in my local network had an available CPU. My HTCondor log files show the job was submitted 07/24 14:17:46, started running 15 seconds later, then ran to completion on that same machine, ending 07/30 11:01:52, a clock time of less than six days. However, the time_end - time_start says that the CPU time was 993535 seconds, or over 11 days. My code is not parallelized at all so I do not understand how this can be. How can this be?
I've ran this code hundreds of times before and never noticed this phenomenon, however I've never checked closely either.
Edit: I wish to note once again that my code is not parallelized, at least not explicitly. I do compiled with the -O3 flag, but I don't think this introduces parallelization. If the linked question/answer about parallel Fortran does indeed answer my question about a serial process, please help me understand how because I do not see the connection.
My HTCondor submission script is as follows. I condor_submit this script and that's how I run the code.
executable = /path/to/executable
universe = standard
log = condorlog.log
output = condorstdout.out
error = condorerror.out
should_transfer_files = IF_NEEDED
when_to_transfer_output = ON_EXIT
queue

Related

Will std::fopen execution time depend on file system content in Linux?

I am running profiling of my C++ code on a purpose built machine which is running Linux. The machine is re-installed sometimes, something which is outside of my control, but i'm thinking that the file system of the machine is gradually filled during the two weeks or so between the cleanups, and that this impacts my profiling measurements.
I have noticed that my profiling results get worse over time and then back to normal when the machine is cleaned up. This led to further investigation and i can see that std::fopen takes ~10 times longer to execute the day before cleanup compared to the day after.
Is it expected that std::fopen execution time is depending on what is stored in the file system? The file I'm opening is located in a directory which is always empty when i start my test case. Could there be some search involved regardless when running std::fopen or why would the execution time vary so much?
std::fopen after cleanup : 0.000098141 seconds
std::fopen before cleanup: 0.000940125 seconds
Running a decently new gcc. The machine has ARM architecture.

Timing in OpenMPI

Quick question about timing in OpenMPI. I see that with qstat -a I can show the wall time, and with qstat I can see the CPU time. Is it possible to have these two values written to the output file when the job is done so I can check the performance of my code?

Why does a project "run" faster in NetBeans internal terminal than in windows command prompt?

I am new in this forum, so excuse me if I am not asking the question right the first time.
I think it is not necessary to provide code here since I think it has nothing to do with the code and the question might be more general.
I have written and built a C++ project in NetBeans 7.1.2 using MinGW (g++) in Windows XP. Both, the Debug version and the Release version work fine and deliver the desired output of the computations. However, if I "run" the project (either project) in NetBeans' internal terminal, I can measure computation times of between 115 and 130 microseconds. If I execute the exe file in a Windows command prompt, I measure computation times of between 500 and 3000 microseconds. (On a 2.2 GHz Intel Core 2 Duo with 2 GB Ram. I measure the time by reading the number of cpu clock ticks since reset and dividing by the cpu frequency.)
I tried the same on another computer (incl. building the project), also 2.2 GHz and 16 GB Ram and Windows7. The program runs a bit faster, but the pattern is the same. The same is true, when I run the project from NetBeans but in an external terminal (also windows terminal). And it applies to both versions, the debug and the release version.
One of the computations (not the most time critical) is a Fast Fourier Transform and depends on the fftw (http://www.fftw.org/) "libfftw3-3.dll" library. Within NetBeans its location is known, in the windows command prompt, the dll has to be in the same directory as the executable.
Why is it, that a program runs so much faster in NetBeans' internal terminal window than in a windows command prompt window?
Could it have something to do with the dynamic linking of the library at load time?
However, the computation step that actually takes longer in windows command prompt than in NetBeans' terminal does not depend on the functions in the dll (the multiplication of two comlex numbers).
The most time consuming and crucial computation in the program is the multiplication of two arrays of complex number of type fftw_complex which is double[2]:
for(int i = 0; i < n; ++i) {
double k1 = x[i][0] * (y[i][0] - y[i][1]);
double k2 = (y[i][1] * -1) * (x[i][0] + x[i][1]);
double k3 = y[i][0] * (x[i][1] - x[i][0]);
result[i][0] = k1 - k2;
result[i][1] = k1 + k3;
}
x and y are two arrays of complex numbers where [i][0] holds the real part and [i][1] holds the imaginary part. result is a preallocated array of the same type.
This loop takes about 5 microseconds for arrays of length 513 when the program is executed in NetBeans' internal terminal, but rather 100 microseconds when I run the program in a Windows command prompt. I could not find an explanation in all the really helpful advice in the answer by sehe.
Please let me know, if you think it has something to do with the actual code, then I would provide some.
I have looked for similar questions, but could not find any. Any hints or points to other questions and answers that I might have missed are appreciated.
Cheers.

As per my comment:
Do you do much console IO? Microsoft console windows are notoriously slow
Also look at the ENVIRONMENT variables, notably:
PATH
Defines search paths. The ordering of search locations may influence load times. To really get a handle on what is being done, consider enabling the fusion log:
Windows Registry Editor Version 5.00
[HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Fusion]
"ForceLog"=dword:00000001
"LogFailures"=dword:00000001
"LogPath"="C:\\Temp\\Fusion"
(See How to enable assembly bind failure logging (Fusion) in .NET for important hints)
This will provide you with detailed information of what locations were searched, and in what order.
LANG
Might override the system locale (try setting the locale to C from your code)
For an explanation of the thinking behind the locale settings, see Unix sort command takes much longer depending on where it is executed?! (fastest from ProcessBuilder in program run from IDE, slowest from terminal)
To set the locale to C :
std::locale::global(std::locale(""));
See also http://msdn.microsoft.com/en-us/library/9dzxxx2c(v=vs.80).aspx
HOME, TEMP
Might influence the location of temporary files

Building large application on Solaris is hanging without any information

I am trying to compile a rather big application on Solaris. Compiling it on AIX caused a problem that the command line buffer was too small (ARG_MAX).
On Solaris it compiles most of application successfullym but then it just hangs and without any error hangs an do nothing for at least an hour.
I am running it on SunOS 5.10 Sparc 32 bit.
Any ideas on how to find out what's going on or what might be causing such behavior?

I can't tell if the compilation is hanging, or your app itself.
If the app is hanging just follow the usual debugging steps: Either run it in your debugger and watch when it dies, or add print statements.
If the compiler dies, does it always die on the same file? If you compile that file by itself does it still hang? If so, try trussing the compiler when you try to build the file that hangs. You may find that it's blocking on I/O waiting for some nonexistant file or something similar.

What you may have to do is:
Comment out or delete 99% of the code and compile that
Add around 5% of the code back in and compile that
if the last thing you added caused the hour hang then split it up
Back to step 2

Just for those who encounter this in future.
The problem was optimization flag causes it to take a REALLY long time to compile. I am talking 1+ hour for one cpp file.
This is big project.
In addition there was an issue with Sys Admin on SUN box not giving me enough CPU share.
Increasing that solved this problem, well made it quicker and within reasonable time bounds.
I hope this helps

Linux time sample based profiler

short version:
Is there a good time based sampling profiler for Linux?
long version:
I generally use OProfile to optimize my applications. I recently found a shortcoming that has me wondering.
The problem was a tight loop, spawning c++filt to demangle a c++ name. I only stumbled upon the code by accident while chasing down another bottleneck. The OProfile didn't show anything unusual about the code so I almost ignored it but my code sense told me to optimize the call and see what happened. I changed the popen of c++filt to abi::__cxa_demangle. The runtime went from more than a minute to a little over a second. About a x60 speed up.
Is there a way I could have configured OProfile to flag the popen call? As the profile data sits now OProfile thinks the bottle neck was the heap and std::string calls (which BTW once optimized dropped the runtime to less than a second, more than x2 speed up).
Here is my OProfile configuration:
$ sudo opcontrol --status
Daemon not running
Event 0: CPU_CLK_UNHALTED:90000:0:1:1
Separate options: library
vmlinux file: none
Image filter: /path/to/executable
Call-graph depth: 7
Buffer size: 65536
Is there another profiler for Linux that could have found the bottleneck?
I suspect the issue is that OProfile only logs its samples to the currently running process. I'd like it to always log its samples to the process I'm profiling. So if the process is currently switched out (blocking on IO or a popen call) OProfile would just place its sample at the blocked call.
If I can't fix this, OProfile will only be useful when the executable is pushing near 100% CPU. It can't help with executables that have inefficient blocking calls.

Glad you asked. I believe OProfile can be made to do what I consider the right thing, which is to take stack samples on wall-clock time when the program is being slow and, if it won't let you examine individual stack samples, at least summarize for each line of code that appears on samples, the percent of samples the line appears on. That is a direct measure of what would be saved if that line were not there. Here's one discussion. Here's another, and another. And, as Paul said, Zoom should do it.
If your time went from 60 sec to 1 sec, that implies every single stack sample would have had a 59/60 probability of showing you the problem.

Try Zoom - I believe it will let you profile all processes - it would be interesting to know if it highlights your problem in this case.

I wrote this a long time ago, only because I couldn't find anything better: https://github.com/dicej/profile
I just found this, too, though I haven't tried it: https://github.com/oliver/ptrace-sampler

Quickly hacked up trivial sampling profiler for linux: http://vi-server.org/vi/simple_sampling_profiler.html
It appends backtrace(3) to a file on SIGUSR1, and then converts it to annotated source.

After trying everything suggested here (except for the now-defunct Zoom, which is still available as huge file from dropbox), I found that NOTHING does what Mr. Dunlavey recommends. The "quick hacks" listed above in some of the answers wouldn't build for me, or didn't work for me either. Spent all day trying stuff... and nothing could find fseek as a hotspot in an otherwise simple test program that was I/O bound.
So I coded up yet another profiler, this time with no build dependencies, based on GDB, so it should "just work" for almost any debuggable code. A single CPP file.
https://github.com/jasonrohrer/wallClockProfiler
It automates the manual process suggested by Mr. Dunlavey, interrupting the target process with GDB periodically and harvesting a stack trace, and then printing a report at the end about which stack traces are the most common. Those are your true wall-clock hotspots. And it actually works.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js