Timing in Fortran Mex file (by using automatic parallelization)

Timing in Fortran Mex file (by using automatic parallelization) - fortran

I am currently trying to compare the runtimes two Fortran subroutines. Therefore I have written Matlab MEX files for easier accessing the codes from there. The first thing I did was to measure the time of an individual call of each routine (within the MEX file):
CALL DTIME( TARRAY, TIME )
CALL MY_PROGRAM1( ... )
CALL DTIME( TARRAY, TIME )
and
CALL DTIME( TARRAY, TIME )
CALL MY_PROGRAM2( ... )
CALL DTIME( TARRAY, TIME )
which gives 1.53s for Program 1 and 0.93s for Program 2.
Now, in order to also perform timings for smaller problems where the resolution of DTIME is not good enough, I put the calls from above into a loop to solve the problem, say 10 times:
CALL DTIME( TARRAY, TIME )
DO 10 K = 1, 10
CALL MY_PROGRAM1/2( ... )
10 CONTINUE
CALL DTIME( TARRAY, TIME )
However, now I get 9.23s for Program 1 (should now be like 15.3s) and 9.16s for Program 2, so the relations of the timings are completely different from the two calls above.
I have a 64bit ubuntu machine with 4 cores, so I suppose there might be some automatic parallelization in the DO loop of Program 1. But it seems that this is not done for Program 2, even though I used the same options for mexing. Does anyone have an idea what the problem above could be and how to solve it (maybe prevent automatic parallelization just in the loop above?)? Many thanks in advance!
Matthias

Related

Fortran execution time

I am new with Fortran and I would like to ask for help. My code is very simple. It just enters a loop and then using system intrinsic procedure enters the file with the name code and runs the evalcode.x program.
program subr1
implicit none
integer :: i,
real :: T1,T2
call cpu_time(T1)
do i=1,6320
call system ("cd ~/code; ../evalcede/source/evalcode.x test ")
enddo
call cpu_time(T2)
print *, T1,T2
end program subr1
The time measured that the program is actually running is 0.5 sec, but time that this code actually needs for execution is 1.5 hours! The program is suspended or waiting and I do not know why.

note: this is more an elaborated comment to the post of Janneb to provide a bit more information.
As indicated by Janneb, the function CPU_TIME does not necesarily return wall-clock time, what you are after. This especially when timing system calls.
Furthermore, the output of CPU_TIME is really a processor and compiler dependent value. To demonstrate this, the following code is compiled with gfortran, ifort and solaris-studio f90:
program test_cpu_time
real :: T1,T2
call cpu_time(T1)
call execute_command_line("sleep 5")
call cpu_time(T2)
print *, T1,T2, T2-T1
end program test_cpu_time
#gfortran>] 1.68200000E-03 1.79799995E-03 1.15999952E-04
#ifort >] 1.1980000E-03 1.3410000E-03 1.4299992E-04
#f90 >] 0.0E+0 5.00534 5.00534
Here, you see that both gfortran and ifort exclude the time of the system-command while solaris-studio includes the time.
In general, one should see the difference between the output of two consecutive calls to CPU_TIME as the time spend by the CPU to perform the actions. Due to the system call, the process is actually in a sleep state during the time of execution and thus no CPU time is spent. This can be seen by a simple ps:
$ ps -O ppid,nlwp,psr,stat $(pgrep sleep) $(pgrep a.out)
PID PPID NLWP PSR STAT S TTY TIME COMMAND
27677 17146 1 2 SN+ S pts/40 00:00:00 ./a.out
27678 27677 1 1 SN+ S pts/40 00:00:00 sleep 5
NLWP indicates how many threads in use
PPID indicates parent PID
STAT indicates 'S' for interruptible sleep (waiting for an event to complete)
PSR is the cpu/thread it is running on.
You notice that the main program a.out is in a sleep state and both the system call and the main program are running on separate cores. Since the main program is in a sleep state, the CPU_TIME will not clock this time.
note: solaris-studio is the odd duck, but then again, it's solaris studio!
General comment: CPU_TIME is still useful for determining the execution time of segments of code. It is not useful for timing external programs. Other more dedicated tools exist for this such as time: The OP's program could be reduced to the bash command:
$ time ( for i in $(seq 1 6320); do blabla; done )
This is what the standard has to say on CPU_TIME(TIME)
CPU_TIME(TIME)
Description: Return the processor time.
Note:13.9: A processor for which a single result is inadequate (for example, a parallel processor) might choose to
provide an additional version for which time is an array.
The exact definition of time is left imprecise because of the variability in what different processors are able
to provide. The primary purpose is to compare different algorithms on the same processor or discover which
parts of a calculation are the most expensive.
The start time is left imprecise because the purpose is to time sections of code, as in the example.
Most computer systems have multiple concepts of time. One common concept is that of time expended by
the processor for a given program. This might or might not include system overhead, and has no obvious
connection to elapsed “wall clock” time.
source: Fortran 2008 Standard, Section 13.7.42
On top of that:
It is processor dependent whether the results returned from CPU_TIME, DATE_AND_TIME and SYSTEM_CLOCK are dependent on which image calls them.
Note 13.8: For example, it is unspecified whether CPU_TIME returns a per-image or per-program value, whether all
images run in the same time zone, and whether the initial count, count rate, and maximum in SYSTEM_CLOCK are the same for all images.
source: Fortran 2008 Standard, Section 13.5

The CPU_TIME intrinsic measures CPU time consumed by the program itself, not including those of it's subprocesses (1).
Apparently most of the time is spent in evalcode.x which explains why the reported wallclock time is much higher.
If you want to measure wallclock time intervals in Fortran, you can use the SYSTEM_CLOCK intrinsic.
(1) Well, that's what GFortran does, at least. The standard doesn't specify exactly what it means.

Running multiple while true loops independently in python

Essentially I have 2 "while True:" loops in my code. Both of the loops are right at the end. However when I run the code, only the first while True: loop gets run, and the second one gets ignored.
For example:
while True:
print "hi"
while True:
print "bye"
Here, it will continuously print hi, but wont print bye at all (the actual code has a tracer.execute() for one loop, and the other is listening to a port, and they both work on their own).
Is there any way to get both loops to work at the same time independently?

Yes.A way to get both loops to work at the same time independently:
Your initial surprise was related to the nature, how Finite-State-Automata actually work.
[0]: any-processing-will-always-<START>-here
[1]: Read a next instruction
[2]: Execute the instruction
[3]: GO TO [1]
The stream of abstract instructions is being executed in a pure-[SERIAL] manner, one after another. There is no other way in the CPU since uncle Turing.
Your desire to have more streams-of-instructions run at the same time independently is called [CONCURRENT] process-scheduling.
You have several tools for achieving a wanted modus-operandi:
Read about a weaker form, using just a thread-based concurrency ( which, due to a Python-specific GIL-locking, yet executes on the physical hardware as a [CONCURRENT]-processing, but GIL-interleaving ( which was knowingly implemented as a very cheap form of a collision-avoidance for each and every case, that this [CONCURRENCY] might introduce ) will finally interleave each of the ( now ) [CONCURRENT]-streams, so as to principally avoid colliding access to any Python object at the same time. If you are fine with this execute-just-one-instruction-stream-fragment-at-a-time ( and round-robin their actual order of GIL-stepped execution ), you can live in a safe and collision-free world.
Another tool, Python may use, is the joblib.Parallel()( joblib.delayed() ), where you will have to master a bit more things to make these ( now a set of fully spawned subprocesses, each ( yes, each ) having a full-copy of python-state + all variables ( read: a lot of time and memory needed to spawn 'em ) and no mutual coordination ).
So decide about which form is just-enough for the kind of your use-case, and better check the new Amdahl's Law re-formulation carefully ( implications on costs of going distributed or parallel )

The value given by cpu_time does not change over little time

I want to check how much time does it take the computer to compute a function. To do this I wanted to compare the values given by the cpu_time subroutine before and after calling my function. To my surprise, the value before and after was the same as if it took zero time to perform the function. To check it, I created a simple piece of code
call cpu_time(time)
write(*,*) time
do i=1,10000
!simple math equation here
end do
call cpu_time(time)
write(*,*) time
And after running the program the value printed before and after the loop was exaclty the same. My guess is that the system clock is not precise enough to dinstinguish such a little changes, but does it really make sense? All in all I don't know how to measure the time needed for executing my function without this working properly.

No speedup with OpenMP in DO loop [duplicate]

I have a Fortran 90 program calling a multi threaded routine. I would like to time this program from the calling routine. If I use cpu_time(), I end up getting the cpu_time for all the threads (8 in my case) added together and not the actual time it takes for the program to run. The etime() routine seems to do the same. Any idea on how I can time this program (without using a stopwatch)?

Try omp_get_wtime(); see http://gcc.gnu.org/onlinedocs/libgomp/omp_005fget_005fwtime.html for the signature.

If this is a one-off thing, then I agree with larsmans, that using gprof or some other profiling is probably the way to go; but I also agree that it is very handy to have coarser timers in your code for timing different phases of the computation. The best timing information you have is the stuff you actually use, and it's hard to beat stuff that's output every single tiem you run your code.
Jeremia Wilcock pointing out omp_get_wtime() is very useful; it's standards compliant so should work on any OpenMP compiler - but it only has second resolution, which may or may not be enough, depending on what you're doing. Edited; the above was completely wrong.
Fortran90 defines system_clock() which can also be used on any standards-compliant compiler; the standard doesn't specify a time resolution, but gfortran it seems to be milliseconds and ifort seems to be microseconds. I usually use it in something like this:
subroutine tick(t)
integer, intent(OUT) :: t
call system_clock(t)
end subroutine tick
! returns time in seconds from now to time described by t
real function tock(t)
integer, intent(in) :: t
integer :: now, clock_rate
call system_clock(now,clock_rate)
tock = real(now - t)/real(clock_rate)
end function tock
And using them:
call tick(calc)
! do big calculation
calctime = tock(calc)
print *,'Timing summary'
print *,'Calc: ', calctime

Limit recursive calls in C++ (about 5000)?

In order to know the limit of the recursive calls in C++ i tried this function !
void recurse ( int count ) // Each call gets its own count
{
printf("%d\n",count );
// It is not necessary to increment count since each function's
// variables are separate (so each count will be initialized one greater)
recurse ( count + 1 );
}
this program halt when count is equal 4716 ! so the limit is just 4716 !!
I'm a little bit confused !! why the program stops exeuction when the count is equal to 4716 !!
PS: Executed under Visual studio 2010.
thanks

The limit of recursive calls depends on the size of the stack. The C++ language is not limiting this (from memory, there is a lower limit of how many function calls a standards conforming compiler will need to support, and it's a pretty small value).
And yes, recursing "infinitely" will stop at some point or another. I'm not entirely sure what else you expect.
It is worth noting that designing software to do "boundless" recursion (or recursion that runs in to the hundreds or thousands) is a very bad idea. There is no (standard) way to find out the limit of the stack, and you can't recover from a stack overflow crash.
You will also find that if you add an array or some other data structure [and use it, so it doesn't get optimized out], the recursion limit goes lower, because each stack-frame uses more space on the stack.
Edit: I actually would expect a higher limit, I suspect you are compiling your code in debug mode. If you compile it in release mode, I expect you get several thousand more, possibly even endless, because the compiler converts your tail-recursion into a loop.

The stack size is dependent on your environment.
In *NIX for instance, you can modify the stack size in the environment, then run your program and the result will be different.
In Windows, you can change it this way (source):
$ editbin /STACK:reserve[,commit] program.exe

You've probably run out of stack space.
Every time you call the recursive function, it needs to push a return address on the stack so it knows where to return to after the function call.
It crashes at 4716 because it just happens to run out of stack space after about 4716 iterations.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js