The value given by cpu_time does not change over little time - fortran

I want to check how much time does it take the computer to compute a function. To do this I wanted to compare the values given by the cpu_time subroutine before and after calling my function. To my surprise, the value before and after was the same as if it took zero time to perform the function. To check it, I created a simple piece of code
call cpu_time(time)
write(*,*) time
do i=1,10000
!simple math equation here
end do
call cpu_time(time)
write(*,*) time
And after running the program the value printed before and after the loop was exaclty the same. My guess is that the system clock is not precise enough to dinstinguish such a little changes, but does it really make sense? All in all I don't know how to measure the time needed for executing my function without this working properly.

Related

Fortran execution time

I am new with Fortran and I would like to ask for help. My code is very simple. It just enters a loop and then using system intrinsic procedure enters the file with the name code and runs the evalcode.x program.
program subr1
implicit none
integer :: i,
real :: T1,T2
call cpu_time(T1)
do i=1,6320
call system ("cd ~/code; ../evalcede/source/evalcode.x test ")
enddo
call cpu_time(T2)
print *, T1,T2
end program subr1
The time measured that the program is actually running is 0.5 sec, but time that this code actually needs for execution is 1.5 hours! The program is suspended or waiting and I do not know why.
note: this is more an elaborated comment to the post of Janneb to provide a bit more information.
As indicated by Janneb, the function CPU_TIME does not necesarily return wall-clock time, what you are after. This especially when timing system calls.
Furthermore, the output of CPU_TIME is really a processor and compiler dependent value. To demonstrate this, the following code is compiled with gfortran, ifort and solaris-studio f90:
program test_cpu_time
real :: T1,T2
call cpu_time(T1)
call execute_command_line("sleep 5")
call cpu_time(T2)
print *, T1,T2, T2-T1
end program test_cpu_time
#gfortran>] 1.68200000E-03 1.79799995E-03 1.15999952E-04
#ifort >] 1.1980000E-03 1.3410000E-03 1.4299992E-04
#f90 >] 0.0E+0 5.00534 5.00534
Here, you see that both gfortran and ifort exclude the time of the system-command while solaris-studio includes the time.
In general, one should see the difference between the output of two consecutive calls to CPU_TIME as the time spend by the CPU to perform the actions. Due to the system call, the process is actually in a sleep state during the time of execution and thus no CPU time is spent. This can be seen by a simple ps:
$ ps -O ppid,nlwp,psr,stat $(pgrep sleep) $(pgrep a.out)
PID PPID NLWP PSR STAT S TTY TIME COMMAND
27677 17146 1 2 SN+ S pts/40 00:00:00 ./a.out
27678 27677 1 1 SN+ S pts/40 00:00:00 sleep 5
NLWP indicates how many threads in use
PPID indicates parent PID
STAT indicates 'S' for interruptible sleep (waiting for an event to complete)
PSR is the cpu/thread it is running on.
You notice that the main program a.out is in a sleep state and both the system call and the main program are running on separate cores. Since the main program is in a sleep state, the CPU_TIME will not clock this time.
note: solaris-studio is the odd duck, but then again, it's solaris studio!
General comment: CPU_TIME is still useful for determining the execution time of segments of code. It is not useful for timing external programs. Other more dedicated tools exist for this such as time: The OP's program could be reduced to the bash command:
$ time ( for i in $(seq 1 6320); do blabla; done )
This is what the standard has to say on CPU_TIME(TIME)
CPU_TIME(TIME)
Description: Return the processor time.
Note:13.9: A processor for which a single result is inadequate (for example, a parallel processor) might choose to
provide an additional version for which time is an array.
The exact definition of time is left imprecise because of the variability in what different processors are able
to provide. The primary purpose is to compare different algorithms on the same processor or discover which
parts of a calculation are the most expensive.
The start time is left imprecise because the purpose is to time sections of code, as in the example.
Most computer systems have multiple concepts of time. One common concept is that of time expended by
the processor for a given program. This might or might not include system overhead, and has no obvious
connection to elapsed “wall clock” time.
source: Fortran 2008 Standard, Section 13.7.42
On top of that:
It is processor dependent whether the results returned from CPU_TIME, DATE_AND_TIME and SYSTEM_CLOCK are dependent on which image calls them.
Note 13.8: For example, it is unspecified whether CPU_TIME returns a per-image or per-program value, whether all
images run in the same time zone, and whether the initial count, count rate, and maximum in SYSTEM_CLOCK are the same for all images.
source: Fortran 2008 Standard, Section 13.5
The CPU_TIME intrinsic measures CPU time consumed by the program itself, not including those of it's subprocesses (1).
Apparently most of the time is spent in evalcode.x which explains why the reported wallclock time is much higher.
If you want to measure wallclock time intervals in Fortran, you can use the SYSTEM_CLOCK intrinsic.
(1) Well, that's what GFortran does, at least. The standard doesn't specify exactly what it means.

FFTW in Fortran result contains only zeros

I have been trying to write a simple program to perform an fft on a 1D input array using fftw3. Here I am using a seismogram as an input. The output array is, however, coming out to contain only zeroes.
I know that the input is correct as I have tried doing the fft of the same input file in MATLAB as well, which gives correct results. There is no compilation error. I am using f95 to compile this, however, gfortran was also giving pretty much the same results. Here is the code that I wrote:-
program fft
use functions
implicit none
include 'fftw3.f90'
integer nl,row,col
double precision, allocatable :: data(:,:),time(:),amplitude(:)
double complex, allocatable :: out(:)
integer*8 plan
open(1,file='test-seismogram.xy')
nl=nlines(1,'test-seismogram.xy')
allocate(data(nl,2))
allocate(time(nl))
allocate(amplitude(nl))
allocate(out(nl/2+1))
do row = 1,nl
read(1,*,end=101) data(row,1),data(row,2)
amplitude(row)=data(row,2)
end do
101 close(1)
call dfftw_plan_dft_r2c_1d(plan,nl,amplitude,out,FFTW_R2HC,FFTW_PATIENT)
call dfftw_execute_dft_r2c(plan, amplitude, out)
call dfftw_destroy_plan(plan)
do row=1,(nl/2+1)
print *,out(row)
end do
deallocate(data)
deallocate(amplitude)
deallocate(time)
deallocate(out)
end program fft
The nlines() function is a function which is used to calculate the number of lines in a file, and it works correctly. It is defined in the module called functions.
This program pretty much tries to follow the example at http://www.fftw.org/fftw3_doc/Fortran-Examples.html
There might just be a very simple logical error that I am making, but I am seriously unable to figure out what is going wrong here. Any pointers would be very helpful.
This is pretty much how the whole output looks like:-
.
.
.
(0.0000000000000000,0.0000000000000000)
(0.0000000000000000,0.0000000000000000)
(0.0000000000000000,0.0000000000000000)
(0.0000000000000000,0.0000000000000000)
(0.0000000000000000,0.0000000000000000)
.
.
.
My doubt is directly regarding fftw, since there is a tag for fftw on SO, so I hope this question is not off topic
As explained in the comments first by #roygvib and #Ross, the plan subroutines overwrite the input arrays because they try the transform many times with different parameters. I will add some practical use considerations.
You claim you do care about performance. Then there are two possibilities:
You do the transform only once as you show in your code. Then there is no point to use FFTW_MEASURE. The planning subroutine is many times slower than actual plan execute subroutine. Use FFTW_ESTIMATE and it will be much faster.
FFTW_MEASURE tells FFTW to find an optimized plan by actually
computing several FFTs and measuring their execution time. Depending
on your machine, this can take some time (often a few seconds).
FFTW_MEASURE is the default planning option.
FFTW_ESTIMATE specifies that, instead of actual measurements of
different algorithms, a simple heuristic is used to pick a (probably
sub-optimal) plan quickly. With this flag, the input/output arrays are
not overwritten during planning.
http://www.fftw.org/fftw3_doc/Planner-Flags.html
You do the same transform many times for different data. Then you must do the planning only once before the first transform and than re-use the plan. Just make the plan first and only then you fill the array with the first input data. Making the plan before every transport would make the program extremely slow.

No speedup with OpenMP in DO loop [duplicate]

I have a Fortran 90 program calling a multi threaded routine. I would like to time this program from the calling routine. If I use cpu_time(), I end up getting the cpu_time for all the threads (8 in my case) added together and not the actual time it takes for the program to run. The etime() routine seems to do the same. Any idea on how I can time this program (without using a stopwatch)?
Try omp_get_wtime(); see http://gcc.gnu.org/onlinedocs/libgomp/omp_005fget_005fwtime.html for the signature.
If this is a one-off thing, then I agree with larsmans, that using gprof or some other profiling is probably the way to go; but I also agree that it is very handy to have coarser timers in your code for timing different phases of the computation. The best timing information you have is the stuff you actually use, and it's hard to beat stuff that's output every single tiem you run your code.
Jeremia Wilcock pointing out omp_get_wtime() is very useful; it's standards compliant so should work on any OpenMP compiler - but it only has second resolution, which may or may not be enough, depending on what you're doing. Edited; the above was completely wrong.
Fortran90 defines system_clock() which can also be used on any standards-compliant compiler; the standard doesn't specify a time resolution, but gfortran it seems to be milliseconds and ifort seems to be microseconds. I usually use it in something like this:
subroutine tick(t)
integer, intent(OUT) :: t
call system_clock(t)
end subroutine tick
! returns time in seconds from now to time described by t
real function tock(t)
integer, intent(in) :: t
integer :: now, clock_rate
call system_clock(now,clock_rate)
tock = real(now - t)/real(clock_rate)
end function tock
And using them:
call tick(calc)
! do big calculation
calctime = tock(calc)
print *,'Timing summary'
print *,'Calc: ', calctime

Interpreting GPerfTools sample count

I'm struggling a little with reading the textual output the GPerfTools generate. I think part of the problem is that I don't fully understand how the sampling method operates.
From Wikipedia I gather that profilers based on sample functions usually work by sending an interrupt to the OS and querying the program's current instruction pointer. Now my knowledge about assembly is a little rusty, so I'm wondering what it means if the instruction pointer points to method m at any given time? I.e. does it mean that the function is about to be called or does it mean it's currently executed, or both?
There's a difference if I'm not mistaken, because in the first case the sample count (i.e. times m is seen while taking a sample) translates to the absolute call count of m, while in the latter case it simply translates to times seen, i.e. a mere indication of relative time spent in this method.
Can someone clarify?

Memory scanner with a slow scan

I'm working on a Memory Scanner, but the scan is so slow.. can anybody help-me improve it?
procedure FirstScan(scantype, scanvalue: string);
var
value :integer;
dwEndAddr : dword;
i:dword;
mbi : TMemoryBasicInformation;
begin
while (VirtualQuery(Pointer(DWORD(mbi.BaseAddress) + MBI.RegionSize), MBI, SizeOf(MEMORY_BASIC_INFORMATION))=SizeOf(TMemoryBasicInformation)) do begin
if (MBI.State = MEM_COMMIT) and (MBI.Protect = PAGE_READWRITE) then begin
dwEndAddr := DWORD(mbi.BaseAddress) + MBI.RegionSize;
for i := DWORD(MBI.BaseAddress) to (dwEndAddr - 1 - sizeof(DWORD)) do begin
Application.ProcessMessages;
try
if scantype = '1 Byte' then begin
value := PBYTE(i)^;
if scanvalue = IntToStr(value) then ListBox1.Items.Add(IntToHex(i,8));
end;
//others scantypes here...
except
Break;
end;
end;
end;
end;
end;
I've learned that I need to read 4096 byte pages at a time then store those in memory and do operations on it, until I need a new page then get another 4096 byte page...
But I don't know how can I do that...
Can anybody help-me? The code can be in C or C++...
I can help you a bit...get that Application.ProcessMessages out of the inner loop. You're calling it for every. single. byte. you. scan. You don't need to be quite that responsive to window messages. :)
Move it up into the outer loop, and you should see a significant speed increase. I'd say spawn a thread and get rid of Application.ProcessMessages altogether, as it's really not this code's job to handle, but i'm not sure how/whether Delphi does threads.
Also....you're passing the scan params as strings? If you insist on doing that, set an int or enum or something before you start the loop that says what scan type to use, convert the value to a useful type for your search, and compare that. String comparisons tend to be slower than integer comparisons, particularly when you're creating new strings each time.
To make slow code fast, there are a few things you can do. First, make sure your code is correct. Wrong results are still wrong results, even if you get them quickly. To that end, make sure that when you call VirtualQuery, you're passing in valid values for all the parameters. At that start of this function mbi is uninitialized, so the result of DWORD(mbi.BaseAddress) + MBI.RegionSize will be who-knows-what.
Once you have correctly working code, there are two ways to make it faster:
Find the slow parts and make them fast. To do this right, you need a profiler. A profiler will observe your program as it runs, and then tell you what percentage of time your program spent executing each part. That tells you where to focus your efforts.
Replace slow algorithms with faster algorithms. This might mean throwing away the entire function, or it might mean fixing just certain parts of the code.
For example, profiling might show that you spend a lot of time call ProcessMessages. You can't really make that function any faster since it's part of the VCL, but you can call it less often. You might even find that you don't need to call it at all, if the thread you're running this code on isn't expected to receive any messages that need processing.
Profiling might show that you're spending a lot of time doing string comparisons. If the starts of your strings are frequently equal, and usually only differ at the end, then you might wish to change your string-comparison algorithm to start comparing strings at the last character instead of the first.
Profiling might show that you're spending a lot of time converting integers into strings before you compare them. Most programming languages support comparing integers directly, so instead of using a string-comparing algorithm, you could try using an integer-comparing algorithm instead. You could convert scanvalue to an integer with StrToInt(scanvalue) and compare it directly to value.
Profiling might show that you're repeatedly calculating the same result from the same input. If a value doesn't change over some portion of a program, then values calculated from it won't change, either. You can reduce the cost of converting values by doing it only when a value has changed. For example, if you do integer comparisons, then you'll probably find that the integer version of scanvalue doesn't change in your function. You could convert scanvalue to an integer once at the start of the function, and then compare value to that inside the loop instead of calling StrToInt(scanvalue) lots of times.
Also: you're converting the current pointer to a string for every byte you access. Yech, horribly slow. Instead, change scanvalue to a BYTE at the start of your procedure and do the comparison directly on BYTEs.
Finally -- pull the "if scantype = '1 byte'" out of the for i:=DWORD(MBI.BaseAddress) loop. You don't want to be doing that "if" statement for every byte -- instead, do
if scantype = '1 byte' then
for i:= DWORD(...)
else if scantype='other scan type' then...
and so on. (And yes, you should convert that "if scantype" comparison to an enum or whatnot.)