Inconsistency when profiling my code with gprof - fortran

I am using a relatively simple code parallelize with OpenMP to familiarize myself with gprof.
My code mainly consists of gathering data from input files, perform some array manipulations and write the new data to different output files. I placed some calls to the intrinsic subroutine CPU_TIME to see if gprof was being accurate:
PROGRAM main
USE global_variables
USE fileio, ONLY: read_old_restart, write_new_restart, output_slice, write_solution
USE change_vars
IMPLICIT NONE
REAL(dp) :: t0, t1
!~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
CALL CPU_TIME(t0)
CALL allocate_data
CALL CPU_TIME(t1)
PRINT*, "Allocate data =", t1 - t0
!~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
CALL CPU_TIME(t0)
CALL build_grid
CALL CPU_TIME(t1)
PRINT*, "Build grid =", t1 - t0
!~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
CALL CPU_TIME(t0)
CALL read_old_restart
CALL CPU_TIME(t1)
PRINT*, "Read restart =", t1 - t0
!~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
CALL CPU_TIME(t0)
CALL regroup_all
CALL CPU_TIME(t1)
PRINT*, "Regroup all =", t1 - t0
!~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
CALL CPU_TIME(t0)
CALL redistribute_all
CALL CPU_TIME(t1)
PRINT*, "Redistribute =", t1 - t0
!~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
CALL CPU_TIME(t0)
CALL write_new_restart
CALL CPU_TIME(t1)
PRINT*, "Write restart =", t1 - t0
END PROGRAM main
Here is the output:
Allocate data = 1.000000000000000E-003
Build grid = 0.000000000000000E+000
Read restart = 10.7963590000000
Regroup all = 6.65998700000000
Redistribute = 14.3518180000000
Write restart = 53.5218640000000
Therefore, the write_new_restart subroutine is the most time consuming and takes about 62% of the total run time. However according to grof, the subroutine redistribute_vars, which is called multiple times by redistribute_all is the most time consuming with 70% of the total time. Here is the output from gprof:
Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls s/call s/call name
74.40 8.95 8.95 61 0.15 0.15 change_vars_mp_redistribute_vars_
19.12 11.25 2.30 60 0.04 0.04 change_vars_mp_regroup_vars_
6.23 12.00 0.75 63 0.01 0.01 change_vars_mp_fill_last_blocks_
0.08 12.01 0.01 1 0.01 2.31 change_vars_mp_regroup_all_
0.08 12.02 0.01 __intel_ssse3_rep_memcpy
0.08 12.03 0.01 for_open
0.00 12.03 0.00 1 0.00 12.01 MAIN__
0.00 12.03 0.00 1 0.00 0.00 change_vars_mp_build_grid_
0.00 12.03 0.00 1 0.00 9.70 change_vars_mp_redistribute_all_
0.00 12.03 0.00 1 0.00 0.00 fileio_mp_read_old_restart_
0.00 12.03 0.00 1 0.00 0.00 fileio_mp_write_new_restart_
0.00 12.03 0.00 1 0.00 0.00 global_variables_mp_allocate_data_
index % time self children called name
0.00 12.01 1/1 main [2]
[1] 99.8 0.00 12.01 1 MAIN__ [1]
0.00 9.70 1/1 change_vars_mp_redistribute_all_ [3]
0.01 2.30 1/1 change_vars_mp_regroup_all_ [5]
0.00 0.00 1/1 global_variables_mp_allocate_data_ [13]
0.00 0.00 1/1 change_vars_mp_build_grid_ [10]
0.00 0.00 1/1 fileio_mp_read_old_restart_ [11]
0.00 0.00 1/1 fileio_mp_write_new_restart_ [12]
-----------------------------------------------
<spontaneous>
[2] 99.8 0.00 12.01 main [2]
0.00 12.01 1/1 MAIN__ [1]
-----------------------------------------------
0.00 9.70 1/1 MAIN__ [1]
[3] 80.6 0.00 9.70 1 change_vars_mp_redistribute_all_ [3]
8.95 0.00 61/61 change_vars_mp_redistribute_vars_ [4]
0.75 0.00 63/63 change_vars_mp_fill_last_blocks_ [7]
-----------------------------------------------
8.95 0.00 61/61 change_vars_mp_redistribute_all_ [3]
[4] 74.4 8.95 0.00 61 change_vars_mp_redistribute_vars_ [4]
-----------------------------------------------
0.01 2.30 1/1 MAIN__ [1]
[5] 19.2 0.01 2.30 1 change_vars_mp_regroup_all_ [5]
2.30 0.00 60/60 change_vars_mp_regroup_vars_ [6]
-----------------------------------------------
2.30 0.00 60/60 change_vars_mp_regroup_all_ [5]
[6] 19.1 2.30 0.00 60 change_vars_mp_regroup_vars_ [6]
-----------------------------------------------
0.75 0.00 63/63 change_vars_mp_redistribute_all_ [3]
[7] 6.2 0.75 0.00 63 change_vars_mp_fill_last_blocks_ [7]
-----------------------------------------------
<spontaneous>
[8] 0.1 0.01 0.00 for_open [8]
-----------------------------------------------
<spontaneous>
[9] 0.1 0.01 0.00 __intel_ssse3_rep_memcpy [9]
-----------------------------------------------
0.00 0.00 1/1 MAIN__ [1]
[10] 0.0 0.00 0.00 1 change_vars_mp_build_grid_ [10]
-----------------------------------------------
0.00 0.00 1/1 MAIN__ [1]
[11] 0.0 0.00 0.00 1 fileio_mp_read_old_restart_ [11]
-----------------------------------------------
0.00 0.00 1/1 MAIN__ [1]
[12] 0.0 0.00 0.00 1 fileio_mp_write_new_restart_ [12]
-----------------------------------------------
0.00 0.00 1/1 MAIN__ [1]
[13] 0.0 0.00 0.00 1 global_variables_mp_allocate_data_ [13]
-----------------------------------------------
For your information, regroup_all calls regroup_vars multiple times and redistribute_all calls redistribute_vars and fill_last_blocks multiple times.
I am compiling my code with ifort with the -openmp -O2 -pg options.
QUESTION:
Why is gprof not seeing the time my file i/o subroutines take? (read_old_restart, write_new_restart)

gprof specifically does not include I/O time. It only tries to measure CPU time.
That's because it only does two things: 1) sample the program counter on a 1/100 second clock, and the program counter is meaningless during I/O, and 2) count the number of times any function B is called by any function A.
From the call-counts, it tries to guess how much of each function's CPU time can be attributed to each caller.
That's it's whole advance over pre-existing profilers.
When you use gprof, you should understand what it does and what it doesn't do.

Related

How did they convert IR-Lock pixels to a position on a normal plane 1m in front of the lens

I am working on tracking a moving target using Quadcopter. In my project, I am using an IR-Camera which is a modified version of the Pixy camera, but for detecting IR targets. While I was studying their code I found a part where I couldn't understand it. I tried googling it but didn't find anything or any formula related to it. So I wonder if someone can give me some tips on what equations or formula they used.
Here is the part I didn't understand.
/*
converts IRLOCK pixels to a position on a normal plane 1m in front of the lens based
on a characterization of IR-LOCK with the standard lens, focused such that 2.38mm
of threads are exposed
*/
void AP_IRLock_I2C::pixel_to_1M_plane(float pix_x, float pix_y, float &ret_x, float &ret_y)
{
ret_x = (-0.00293875727162397f*pix_x + 0.470201163459835f)/
(4.43013552642296e-6f*((pix_x - 160.0f)*(pix_x - 160.0f))
+ 4.79331390531725e-6f*((pix_y - 100.0f)*(pix_y - 100.0f)) - 1.0f);
ret_y = (-0.003056843086277f*pix_y + 0.3056843086277f)/
(4.43013552642296e-6f*((pix_x - 160.0f)*(pix_x - 160.0f))
+ 4.79331390531725e-6f*((pix_y - 100.0f)*(pix_y - 100.0f)) - 1.0f);
You can find the rest of the code here.
IRlock Ardupilot
Let's focus on the equation for ret_x, which simplified is like this:
ret_x = (-0.0029 * pix_x + 0.47) /
(4.4e-6 * (pix_x - 160.0)^2 + 4.8e-6 * (pix_y - 100.0)^2 - 1.0);
First, notice the 160 and 100 magic numbers. The Pixy capture resolution is 320x200, so these are there to translate pixel coordinates from a space where (0,0) is in the corner to where it is in the center. So if pix_x is 160 and pix_y is 100, that is the center of the frame, and the denominator will be -1.
The rest of it appears to be a lens correction. Here are the values of ret_x that you get across the range of valid pix_x and pix_y inputs:
0 40 80 120 160 200 240 280 320
0 -0.56 -0.40 -0.25 -0.12 0.00 0.12 0.25 0.40 0.56
20 -0.55 -0.39 -0.25 -0.12 0.00 0.12 0.25 0.39 0.55
40 -0.54 -0.38 -0.25 -0.12 0.00 0.12 0.25 0.38 0.54
60 -0.53 -0.38 -0.24 -0.12 0.00 0.12 0.24 0.38 0.53
80 -0.53 -0.38 -0.24 -0.12 0.00 0.12 0.24 0.38 0.53
100 -0.53 -0.38 -0.24 -0.12 0.00 0.12 0.24 0.38 0.53
120 -0.53 -0.38 -0.24 -0.12 0.00 0.12 0.24 0.38 0.53
140 -0.53 -0.38 -0.24 -0.12 0.00 0.12 0.24 0.38 0.53
160 -0.54 -0.38 -0.25 -0.12 0.00 0.12 0.25 0.38 0.54
180 -0.55 -0.39 -0.25 -0.12 0.00 0.12 0.25 0.39 0.55
200 -0.56 -0.40 -0.25 -0.12 0.00 0.12 0.25 0.40 0.56
So as expected, ret_x is near 0 for pixels near the center (pix_x == 160). And it reaches +/- 0.56 at the extremes, which suggests a horizontal field of view of approximately 120 degrees (from trigonometry, 2*0.56 meter width at 1 meter distance).
The horizontal correction is slightly influenced by the vertical coordinate, notably near the corners. This is presumably to correct for spherical distortion in the lens (which is common).
The astute will recognize that the equation is slightly defective: given pixel coordinates in [0,319] and [0,199], the center values should be 159.5 and 99.5, not 160 and 100.

FFT output is blank when using FFTW_MEASURE, but works fine with FFTW_ESTIMATE

I'm having the following issue in my attempt to use fftw3. For some reason, whenever I do an FFT using FFTW_MEASURE instead of FFTW_ESTIMATE, I get blank output. Ultimately I'm trying to implement fft convolution, so my example below includes both the FFT and the inverse FFT.
Clearly I'm missing something... is anyone able to educate me? Thank you!
I'm on Linux (OpenSUSE Leap 42.1), using the version of fftw3 available from my package manager.
Minimum working example:
#include <iostream>
#include <iomanip>
#include <cmath>
#include <fftw3.h>
using namespace std;
int main(int argc, char ** argv)
{
int width = 10;
int height = 8;
cout.setf(ios::fixed|ios::showpoint);
cout << setprecision(2);
double * inp = (double *) fftw_malloc(sizeof(double) * width * height);
fftw_complex * cplx = (fftw_complex *)fftw_malloc(sizeof(fftw_complex) * height * (width/2 + 1));
for(int i = 0; i < width * height; i++) inp[i] = sin(i);
fftw_plan fft = fftw_plan_dft_r2c_2d(height, width, inp, cplx, FFTW_MEASURE );
fftw_plan ifft = fftw_plan_dft_c2r_2d(height, width, cplx, inp, FFTW_MEASURE );
fftw_execute(fft);
for(int j = 0; j < height; j++)
{
for(int i = 0; i < (width/2 + 1); i++)
{
cout << cplx[i+width*j][0] << " ";
}
cout << endl;
}
cout << endl << endl;
fftw_execute(ifft);
for(int j = 0; j < height; j++)
{
for(int i = 0; i < width; i++)
{
cout << inp[i+width*j] << " ";
}
cout << endl;
}
fftw_destroy_plan(fft);
fftw_destroy_plan(ifft);
fftw_free(cplx);
fftw_free(inp);
return 0;
}
Just change between FFTW_ESTIMATE and FFTW_MEASURE.
Compiled with:
g++ *.cpp -lm -lfftw3 --std=c++11
Output with FFTW_ESTIMATE (first block is the real part of the FT, second block is after inverse FT):
1.51 2.24 -1.52 -0.05 0.15 0.19
0.23 0.15 1.77 1.19 0.54 0.41
1.97 -0.15 -1.32 -2.51 -1.20 -3.38
4.34 15.21 -24.82 -7.44 -4.16 -2.51
-0.43 -0.06 1.55 2.93 -2.81 -0.42
0.00 0.00 0.00 -nan 0.00 0.00
0.00 0.00 0.00 0.00 0.00 -nan
0.00 0.00 0.00 0.00 0.00 0.00
0.00 67.32 72.74 11.29 -60.54 -76.71 -22.35 52.56 79.15 32.97
-43.52 -80.00 -42.93 33.61 79.25 52.02 -23.03 -76.91 -60.08 11.99
73.04 66.93 -0.71 -67.70 -72.45 -10.59 61.00 76.51 21.67 -53.09
-79.04 -32.32 44.11 79.99 42.33 -34.25 -79.34 -51.48 23.71 77.10
59.61 -12.69 -73.32 -66.54 1.42 68.07 72.14 9.89 -61.46 -76.30
-20.99 53.62 78.93 31.67 -44.70 -79.98 -41.72 34.89 79.43 50.94
-24.38 -77.29 -59.13 13.39 73.60 66.15 -2.12 -68.44 -71.83 -9.18
61.91 76.08 20.31 -54.14 -78.81 -31.02 45.29 79.96 41.12 -35.53
Output with FFTW_MEASURE (first block is the real part of the FT, second block is after inverse FT):
0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 0.00 -nan 0.00 0.00
0.00 0.00 0.00 0.00 0.00 -nan
0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
The comment of #Paul_R. is sufficient to solve the problem. The input array can be modified as fftw_plan_dft_r2c_2d() is called. Hence, the input array must be initialized after the creation of the fftw plan.
The documentation of the planner flags of FFTW details what is happening. I am pretty sure that you have already guessed the reason why FFTW_ESTIMATE preserve the input array and FTTW_MEASURE modifies it.
Important: the planner overwrites the input array during planning unless a saved plan (see Wisdom) is available for that problem, so you should initialize your input data after creating the plan.*** The only exceptions to this are the FFTW_ESTIMATE and FFTW_WISDOM_ONLY flags, as mentioned below.
...
FFTW_ESTIMATE specifies that, instead of actual measurements of different algorithms, a simple heuristic is used to pick a (probably sub-optimal) plan quickly. With this flag, the input/output arrays are not overwritten during planning.
FFTW_MEASURE tells FFTW to find an optimized plan by actually computing several FFTs and measuring their execution time. Depending on your machine, this can take some time (often a few seconds). FFTW_MEASURE is the default planning option.
...
The documentation also tells us that the flag FFTW_ESTIMATE will preserve the input. Yet, the best advise is to initialize the array once the plan is created.

CPU high usage of the usleep on Cent OS 6.3

I compile the code below on cent os 5.3 and cent os 6.3:
#include <pthread.h>
#include <list>
#include <unistd.h>
#include <iostream>
using namespace std;
pthread_mutex_t _mutex;
pthread_spinlock_t spinlock;
list<int *> _task_list;
void * run(void*);
int main()
{
int worker_num = 3;
pthread_t pids[worker_num];
pthread_mutex_init(&_mutex, NULL);
for (int worker_i = 0; worker_i < worker_num; ++worker_i)
{
pthread_create(&(pids[worker_i]), NULL, run, NULL);
}
sleep(14);
}
void *run(void * args)
{
int *recved_info;
long long start;
while (true)
{
pthread_mutex_lock(&_mutex);
if (_task_list.empty())
{
recved_info = 0;
}
else
{
recved_info = _task_list.front();
_task_list.pop_front();
}
pthread_mutex_unlock(&_mutex);
if (recved_info == 0)
{
int f = usleep(1);
continue;
}
}
}
While running on the 5.3, you can't even find the process on top, cpu usage is around 0%. But on cent os 6.3, it's about 20% with 6 threads on a 4 cores cpu.
So I check the a.out with time and stace , the results are about that:
On 5.3:
real 0m14.003s
user 0m0.001s
sys 0m0.001s
On 6.3:
real 0m14.002s
user 0m1.484s
sys 0m1.160s
the strace:
on 5.3:
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
91.71 0.002997 0 14965 nanosleep
8.29 0.000271 271 1 execve
0.00 0.000000 0 5 read
0.00 0.000000 0 10 4 open
0.00 0.000000 0 6 close
0.00 0.000000 0 4 4 stat
0.00 0.000000 0 6 fstat
0.00 0.000000 0 22 mmap
0.00 0.000000 0 13 mprotect
0.00 0.000000 0 1 munmap
0.00 0.000000 0 3 brk
0.00 0.000000 0 3 rt_sigaction
0.00 0.000000 0 3 rt_sigprocmask
0.00 0.000000 0 1 1 access
0.00 0.000000 0 3 clone
0.00 0.000000 0 1 uname
0.00 0.000000 0 1 getrlimit
0.00 0.000000 0 1 arch_prctl
0.00 0.000000 0 38 4 futex
0.00 0.000000 0 1 set_tid_address
0.00 0.000000 0 4 set_robust_list
------ ----------- ----------- --------- --------- ----------------
100.00 0.003268 15092 13 total
on 6.3:
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
99.99 1.372813 36 38219 nanosleep
0.01 0.000104 0 409 43 futex
0.00 0.000000 0 5 read
0.00 0.000000 0 6 open
0.00 0.000000 0 6 close
0.00 0.000000 0 6 fstat
0.00 0.000000 0 22 mmap
0.00 0.000000 0 15 mprotect
0.00 0.000000 0 1 munmap
0.00 0.000000 0 3 brk
0.00 0.000000 0 3 rt_sigaction
0.00 0.000000 0 3 rt_sigprocmask
0.00 0.000000 0 7 7 access
0.00 0.000000 0 3 clone
0.00 0.000000 0 1 execve
0.00 0.000000 0 1 getrlimit
0.00 0.000000 0 1 arch_prctl
0.00 0.000000 0 1 set_tid_address
0.00 0.000000 0 4 set_robust_list
------ ----------- ----------- --------- --------- ----------------
100.00 1.372917 38716 50 total
The time and the strace results are not the same test, so data is a little different. But I think it can show something.
I check the kernel config CONFIG_HIGH_RES_TIMERS, CONFIG_HPET and CONFIG_HZ:
On 5.3:
$ cat /boot/config-`uname -r` |grep CONFIG_HIGH_RES_TIMERS
$ cat /boot/config-`uname -r` |grep CONFIG_HPET
CONFIG_HPET_TIMER=y
CONFIG_HPET_EMULATE_RTC=y
CONFIG_HPET=y
# CONFIG_HPET_RTC_IRQ is not set
# CONFIG_HPET_MMAP is not set
$ cat /boot/config-`uname -r` |grep CONFIG_HZ
# CONFIG_HZ_100 is not set
# CONFIG_HZ_250 is not set
CONFIG_HZ_1000=y
CONFIG_HZ=1000
On 6.3:
$ cat /boot/config-`uname -r` |grep CONFIG_HIGH_RES_TIMERS
CONFIG_HIGH_RES_TIMERS=y
$ cat /boot/config-`uname -r` |grep CONFIG_HPET
CONFIG_HPET_TIMER=y
CONFIG_HPET_EMULATE_RTC=y
CONFIG_HPET=y
CONFIG_HPET_MMAP=y
$ cat /boot/config-`uname -r` |grep CONFIG_HZ
# CONFIG_HZ_100 is not set
# CONFIG_HZ_250 is not set
# CONFIG_HZ_300 is not set
CONFIG_HZ_1000=y
CONFIG_HZ=1000
In fact, I also try the code on arch on ARM and xubuntu13.04-amd64-desktop, the same as the cent os 6.3.
So what can I do to figure out the reason of the different CPU usages?
Does it have anything with the kernel config?
You're correct, it has to do with the kernel config. usleep(1) will try to sleep for one microsecond. Before high resolution timers, it was not possible to sleep for less than a jiffy (in your case HZ=1000 so 1 jiffy == 1 millisecond).
On CentOS 5.3 which does not have these high resolution timers, you would sleep between 1ms and 2ms[1]. On CentOS 6.3 which has these timers, you're sleeping for close to one microsecond. That's why you're using more cpu on this platform: you're simply polling your task list 500-1000 times more.
If you change the code to usleep(1000), CentOS 5.3 will behave the same. CentOS 6.3 cpu time will decrease and be in the same ballpark as the program running on CentOS 5.3
There is a full discussion of this in the Linux manual: run man 7 time.
Note that your code should use condition variables instead of polling your task list at a certain time interval. That's a more efficient and clean way to do what you're doing.
Also, your main should really join the threads instead of just sleeping for 14 seconds.
[1] There is one exception. If your application was running under a realtime scheduling policy (SCHED_FIFO or SCHED_RR), it would busy-wait instead of sleeping to sleep close to the right amount. But by default you need root privileges

Strange profiler behavior: same functions, different performances

I was learning to use gprof and then i got weird results for this code:
int one(int a, int b)
{
int i, r = 0;
for (i = 0; i < 1000; i++)
{
r += b / (a + 1);
}
return r;
}
int two(int a, int b)
{
int i, r = 0;
for (i = 0; i < 1000; i++)
{
r += b / (a + 1);
}
return r;
}
int main()
{
for (int i = 1; i < 50000; i++)
{
one(i, i * 2);
two(i, i * 2);
}
return 0;
}
and this is the profiler output
% cumulative self self total
time seconds seconds calls us/call us/call name
50.67 1.14 1.14 49999 22.80 22.80 two(int, int)
49.33 2.25 1.11 49999 22.20 22.20 one(int, int)
If i call one then two the result is the inverse, two takes more time than one
both are the same functions, but the first calls always take less time then the second
Why is that?
Note: The assembly code is exactly the same and code is being compiled with no optimizations
I'd guess it is some fluke in run-time optimisation - one uses a register and the other doesn't or something minor like that.
The system clock probably runs to a precision of 100nsec. The average call time 30nsec or 25nsec is less than one clock tick. A rounding error of 5% of a clock tick is pretty small. Both times are near enough zero.
My guess: it is an artifact of the way mcount data gets interpreted. The granularity for mcount (monitor.h) is on the order of a 32 bit longword - 4 bytes on my system. So you would not expect this: I get different reports from prof vs gprof on the EXACT same mon.out file.
solaris 9 -
prof
%Time Seconds Cumsecs #Calls msec/call Name
46.4 2.35 2.3559999998 0.0000 .div
34.8 1.76 4.11120000025 0.0000 _mcount
10.1 0.51 4.62 1 510. main
5.3 0.27 4.8929999999 0.0000 one
3.4 0.17 5.0629999999 0.0000 two
0.0 0.00 5.06 1 0. _fpsetsticky
0.0 0.00 5.06 1 0. _exithandle
0.0 0.00 5.06 1 0. _profil
0.0 0.00 5.06 20 0.0 _private_exit, _exit
0.0 0.00 5.06 1 0. exit
0.0 0.00 5.06 4 0. atexit
gprof
% cumulative self self total
time seconds seconds calls ms/call ms/call name
71.4 0.90 0.90 1 900.00 900.00 key_2_text <cycle 3> [2]
5.6 0.97 0.07 106889 0.00 0.00 _findbuf [9]
4.8 1.03 0.06 209587 0.00 0.00 _findiop [11]
4.0 1.08 0.05 __do_global_dtors_aux [12]
2.4 1.11 0.03 mem_init [13]
1.6 1.13 0.02 102678 0.00 0.00 _doprnt [3]
1.6 1.15 0.02 one [14]
1.6 1.17 0.02 two [15]
0.8 1.18 0.01 414943 0.00 0.00 realloc <cycle 3> [16]
0.8 1.19 0.01 102680 0.00 0.00 _textdomain_u <cycle 3> [21]
0.8 1.20 0.01 102677 0.00 0.00 get_mem [17]
0.8 1.21 0.01 $1 [18]
0.8 1.22 0.01 $2 [19]
0.8 1.23 0.01 _alloc_profil_buf [22]
0.8 1.24 0.01 _mcount (675)
Is it always the first one called that is slightly slower? If that's the case, I would guess it is a CPU cache doing it's thing. or it could be lazy paging by the operating system.
BTW: what optimization flags are compiling with?

creating matrix with probabilities

I want to generate a matrix of NxN to test some code that I have where each row contains floats as the elements and has to add up to 1 (i.e. a row with a set of probabilities).
Where it gets tricky is that I want to make sure that randomly some of the elements should be 0 (in fact most of the elements should be 0 except for some random ones to be the probabilities). I need the probabilities to be 1/m where m is the number of elements that are not 0 within a single row. I tried to think of ways to output this, but essentially I would need this stored in a C++ array. So even if I output to a file I would still have the issue of not having it in array as I need it. At the end of it all I need that array because I want to generate a Market Matrix file. I found an implementation in C++ to take an array and convert it to the market matrix file, so this is what I am basing my findings on. My input for the rest of the code takes in this market matrix file so I need that to be the primary form of output. The language does not matter, I just want to generate the file at the end (I found a way mmwrite and mmread in python as well)
Please help, I am stuck and not really sure how to implement this.
import random
N = 10
matrix = []
for j in range(N):
t = [int(random.random()<0.6) for i in range(N)]
ones = t.count(1)
row = [float(x)/ones for x in t] if ones else t
matrix.append(row)
for r in matrix:
print r
By C++ array, do you mean a C array or a STL vector<vector< > >? The latter would be cleaner, but here's an example using C arrays:
#include <stdlib.h>
#include <stdio.h>
float* makeProbabilityMatrix(int N, float zeroProbability)
{
float* matrix = (float*)malloc(N*N*sizeof(float));
for (int ii = 0; ii < N; ii++)
{
int m = 0;
for (int jj = 0; jj < N; jj++)
{
int val = (rand() / (RAND_MAX*1.0) < zeroProbability) ? 0 : 1;
matrix[ii*N+jj] = val;
m += val;
}
for (int jj = 0; jj < N; jj++)
{
matrix[ii*N+jj] /= m;
}
}
return matrix;
}
int main()
{
srand(234);
int N = 10;
float* matrix = makeProbabilityMatrix(N, 0.70);
for (int ii = 0; ii < N; ii++)
{
for (int jj = 0; jj < N; jj++)
{
printf("%.2f ", matrix[ii*N+jj]);
}
printf("\n");
}
free(matrix);
return 0;
};
Output:
0.00 0.20 0.20 0.00 0.00 0.00 0.00 0.20 0.20 0.20
0.25 0.00 0.00 0.00 0.00 0.25 0.00 0.25 0.25 0.00
0.00 0.33 0.33 0.33 0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00 0.00 0.50 0.00 0.00 0.50 0.00
0.25 0.25 0.00 0.00 0.00 0.00 0.25 0.00 0.25 0.00
0.00 0.25 0.00 0.00 0.00 0.25 0.25 0.00 0.25 0.00
0.00 0.00 0.33 0.00 0.33 0.00 0.00 0.00 0.33 0.00
0.00 0.20 0.20 0.20 0.20 0.00 0.00 0.20 0.00 0.00
0.20 0.00 0.20 0.00 0.00 0.00 0.00 0.20 0.20 0.20
0.00 0.00 0.00 0.00 0.00 0.50 0.00 0.50 0.00 0.00