Parallelization limit of OMP in DGEMM - fortran

For the following code extended from OpenMP with BLAS
Program bench_dgemm
Use, Intrinsic :: iso_fortran_env, Only : wp => real64, li => int64
Use :: omp_lib
integer, parameter :: dp = selected_real_kind(15, 307)
Real( dp ), Dimension( :, : ), Allocatable :: a
Real( dp ), Dimension( :, :, : ), Allocatable :: b
Real( dp ), Dimension( :, :, : ), Allocatable :: c
Integer :: na, nb, nc, nd, m, m_iter
Integer( li ) :: start, finish, rate
Integer :: numthreads
Integer :: ithr, istart, iend
real(dp) :: sum_time
Write( *, * ) 'numthreads'
Read( *, * ) numthreads
call omp_set_num_threads(numthreads)
Write( *, * ) 'na, nb, nc, nd ?'
Read( *, * ) na, nb, nc, nd
Allocate( a ( 1:na, 1:nb ) )
Allocate( b ( 1:nb, 1:nc, 1:nd ) )
Allocate( c( 1:na, 1:nc, 1:nd ) )
!A[a,b] * B[b,c,d] = C[a,c,d]
Call Random_number( a )
Call Random_number( b )
c = 0.0_dp
m_iter = 30
write (*,*) 'm_iter average', m_iter
write (*,*) 'numthreads', numthreads
sum_time = 0.0
do m = 1, m_iter
Call System_clock( start, rate )
!$omp parallel private(ithr, istart, iend)
ithr = omp_get_thread_num()
istart = ithr * nd / numthreads
iend = (ithr + 1) * nd / numthreads
Call dgemm('N', 'N', na, nc * (iend - istart), nb, 1.0_dp, a, na, &
b(1, 1, 1 + istart), Size(b, Dim = 1), &
0.0_dp, c(1, 1, 1 + istart), Size(c, Dim = 1))
!$omp end parallel
Call System_clock( finish, rate )
sum_time = sum_time + Real( finish - start, dp ) / rate
end do
Write( *, * ) 'Time for dgemm', sum_time / m_iter
End
suppose the file is called bench.f90. I tried ifort bench.f90 -o bench -qopenmp -mkl=sequential, then bench.
For na=nb=nc=nd=200, numthreads=1 gives me
1 Time for dgemm 4.053670000000001E-002
2 Time for dgemm 2.087716666666666E-002
4 Time for dgemm 1.082136666666667E-002
8 Time for dgemm 5.819133333333333E-003
16 Time for dgemm 4.304533333333333E-003
32 Time for dgemm 5.269366666666666E-003
I tried gfortran bench.f90 -o bench -fopenmp -lopenblas and got
1 Time for dgemm 0.13534268956666665
2 Time for dgemm 6.9672616866666662E-002
4 Time for dgemm 3.5927094433333334E-002
8 Time for dgemm 1.8583297666666668E-002
16 Time for dgemm 1.1969903900000000E-002
32 Time for dgemm 1.9136184166666667E-002
It seems the omp gets less speed up in 32 cores (Intel(R) Xeon(R) Gold 6148 CPU # 2.40GHZ 2 sockets. Thus 40 cores). I think the split of indices is the external one in a matrix. Similar to A[a,b]B[b,c], the code splits c into several segments. It should be straightforward to parallel. So, why the performance does not get much faster ~ 32 cores? (If the dimension of c in B[a,c,d] is only 30, I can imagine 32 core will not help.)
Does MPI have a better performance comparing with the OpenMP and the ideal scaling?

We tried the shared sample code at our end and we could see the results as below.
Try "setenv OMP_PROC_BIND true" and exporting the same, as it should help in your case.
numthreads
1
na, nb, nc, nd ?
200
200
200
200
m_iter average 30
numthreads 1
MKL_VERBOSE oneMKL 2022.0 Product build 20211112 for Intel(R) 64 architecture Intel(R) Advanced Vector Extensions 2 (Intel(R) AVX2) enabled processors, Lnx 2.20GHz lp64 sequential
MKL_VERBOSE DGEMM(N,N,200,40000,200,0x4b1490,0x1488e659b240,200,0x1488d7cfa280,200,0x4b1498,0x1488d3cf92c0,200) 113.23ms CNR:OFF Dyn:1 FastMM:1
MKL_VERBOSE DGEMM(N,N,200,40000,200,0x4b1490,0x1488e659b240,200,0x1488d7cfa280,200,0x4b1498,0x1488d3cf92c0,200) 113.50ms CNR:OFF Dyn:1 FastMM:1
MKL_VERBOSE DGEMM(N,N,200,40000,200,0x4b1490,0x1488e659b240,200,0x1488d7cfa280,200,0x4b1498,0x1488d3cf92c0,200) 113.66ms CNR:OFF Dyn:1 FastMM:1
MKL_VERBOSE DGEMM(N,N,200,40000,200,0x4b1490,0x1488e659b240,200,0x1488d7cfa280,200,0x4b1498,0x1488d3cf92c0,200) 113.68ms CNR:OFF Dyn:1 FastMM:1
MKL_VERBOSE DGEMM(N,N,200,40000,200,0x4b1490,0x1488e659b240,200,0x1488d7cfa280,200,0x4b1498,0x1488d3cf92c0,200) 113.64ms CNR:OFF Dyn:1 FastMM:1
MKL_VERBOSE DGEMM(N,N,200,40000,200,0x4b1490,0x1488e659b240,200,0x1488d7cfa280,200,0x4b1498,0x1488d3cf92c0,200) 113.63ms CNR:OFF Dyn:1 FastMM:1
MKL_VERBOSE DGEMM(N,N,200,40000,200,0x4b1490,0x1488e659b240,200,0x1488d7cfa280,200,0x4b1498,0x1488d3cf92c0,200) 113.67ms CNR:OFF Dyn:1 FastMM:1
MKL_VERBOSE DGEMM(N,N,200,40000,200,0x4b1490,0x1488e659b240,200,0x1488d7cfa280,200,0x4b1498,0x1488d3cf92c0,200) 113.71ms CNR:OFF Dyn:1 FastMM:1
MKL_VERBOSE DGEMM(N,N,200,40000,200,0x4b1490,0x1488e659b240,200,0x1488d7cfa280,200,0x4b1498,0x1488d3cf92c0,200) 113.74ms CNR:OFF Dyn:1 FastMM:1
MKL_VERBOSE DGEMM(N,N,200,40000,200,0x4b1490,0x1488e659b240,200,0x1488d7cfa280,200,0x4b1498,0x1488d3cf92c0,200) 113.68ms CNR:OFF Dyn:1 FastMM:1
MKL_VERBOSE DGEMM(N,N,200,40000,200,0x4b1490,0x1488e659b240,200,0x1488d7cfa280,200,0x4b1498,0x1488d3cf92c0,200) 113.65ms CNR:OFF Dyn:1 FastMM:1
MKL_VERBOSE DGEMM(N,N,200,40000,200,0x4b1490,0x1488e659b240,200,0x1488d7cfa280,200,0x4b1498,0x1488d3cf92c0,200) 113.71ms CNR:OFF Dyn:1 FastMM:1
MKL_VERBOSE DGEMM(N,N,200,40000,200,0x4b1490,0x1488e659b240,200,0x1488d7cfa280,200,0x4b1498,0x1488d3cf92c0,200) 113.68ms CNR:OFF Dyn:1 FastMM:1
MKL_VERBOSE DGEMM(N,N,200,40000,200,0x4b1490,0x1488e659b240,200,0x1488d7cfa280,200,0x4b1498,0x1488d3cf92c0,200) 113.67ms CNR:OFF Dyn:1 FastMM:1
MKL_VERBOSE DGEMM(N,N,200,40000,200,0x4b1490,0x1488e659b240,200,0x1488d7cfa280,200,0x4b1498,0x1488d3cf92c0,200) 116.28ms CNR:OFF Dyn:1 FastMM:1
MKL_VERBOSE DGEMM(N,N,200,40000,200,0x4b1490,0x1488e659b240,200,0x1488d7cfa280,200,0x4b1498,0x1488d3cf92c0,200) 143.58ms CNR:OFF Dyn:1 FastMM:1
MKL_VERBOSE DGEMM(N,N,200,40000,200,0x4b1490,0x1488e659b240,200,0x1488d7cfa280,200,0x4b1498,0x1488d3cf92c0,200) 105.96ms CNR:OFF Dyn:1 FastMM:1
MKL_VERBOSE DGEMM(N,N,200,40000,200,0x4b1490,0x1488e659b240,200,0x1488d7cfa280,200,0x4b1498,0x1488d3cf92c0,200) 105.98ms CNR:OFF Dyn:1 FastMM:1
MKL_VERBOSE DGEMM(N,N,200,40000,200,0x4b1490,0x1488e659b240,200,0x1488d7cfa280,200,0x4b1498,0x1488d3cf92c0,200) 106.06ms CNR:OFF Dyn:1 FastMM:1
MKL_VERBOSE DGEMM(N,N,200,40000,200,0x4b1490,0x1488e659b240,200,0x1488d7cfa280,200,0x4b1498,0x1488d3cf92c0,200) 105.99ms CNR:OFF Dyn:1 FastMM:1
MKL_VERBOSE DGEMM(N,N,200,40000,200,0x4b1490,0x1488e659b240,200,0x1488d7cfa280,200,0x4b1498,0x1488d3cf92c0,200) 106.12ms CNR:OFF Dyn:1 FastMM:1
MKL_VERBOSE DGEMM(N,N,200,40000,200,0x4b1490,0x1488e659b240,200,0x1488d7cfa280,200,0x4b1498,0x1488d3cf92c0,200) 106.06ms CNR:OFF Dyn:1 FastMM:1
MKL_VERBOSE DGEMM(N,N,200,40000,200,0x4b1490,0x1488e659b240,200,0x1488d7cfa280,200,0x4b1498,0x1488d3cf92c0,200) 106.01ms CNR:OFF Dyn:1 FastMM:1
MKL_VERBOSE DGEMM(N,N,200,40000,200,0x4b1490,0x1488e659b240,200,0x1488d7cfa280,200,0x4b1498,0x1488d3cf92c0,200) 105.93ms CNR:OFF Dyn:1 FastMM:1
MKL_VERBOSE DGEMM(N,N,200,40000,200,0x4b1490,0x1488e659b240,200,0x1488d7cfa280,200,0x4b1498,0x1488d3cf92c0,200) 106.08ms CNR:OFF Dyn:1 FastMM:1
MKL_VERBOSE DGEMM(N,N,200,40000,200,0x4b1490,0x1488e659b240,200,0x1488d7cfa280,200,0x4b1498,0x1488d3cf92c0,200) 106.07ms CNR:OFF Dyn:1 FastMM:1
MKL_VERBOSE DGEMM(N,N,200,40000,200,0x4b1490,0x1488e659b240,200,0x1488d7cfa280,200,0x4b1498,0x1488d3cf92c0,200) 106.09ms CNR:OFF Dyn:1 FastMM:1
MKL_VERBOSE DGEMM(N,N,200,40000,200,0x4b1490,0x1488e659b240,200,0x1488d7cfa280,200,0x4b1498,0x1488d3cf92c0,200) 106.10ms CNR:OFF Dyn:1 FastMM:1
MKL_VERBOSE DGEMM(N,N,200,40000,200,0x4b1490,0x1488e659b240,200,0x1488d7cfa280,200,0x4b1498,0x1488d3cf92c0,200) 106.03ms CNR:OFF Dyn:1 FastMM:1
MKL_VERBOSE DGEMM(N,N,200,40000,200,0x4b1490,0x1488e659b240,200,0x1488d7cfa280,200,0x4b1498,0x1488d3cf92c0,200) 106.05ms CNR:OFF Dyn:1 FastMM:1
Time for dgemm 0.116057933333333

Related

How to pause Execution in SAS for milliseconds

How to pause an execution for 5 milliseconds in SAS?
Can I use "CALL SLEEP (0.005)"
I have checked the below link but its confusing
https://documentation.sas.com/doc/en/pgmsascdc/9.4_3.2/lefunctionsref/n12ppys43orawkn1q0oxep4cmdk6.htm
The below shall stop the execution for 5 milliseconds in SAS.
data _null_;
call sleep(5);
run;
You can use the optional unit argument to specify the unit of time in seconds, which is applied to n. Default is .001 (milliseconds). You can change it to seconds if you like
data _null_;
call sleep(0.005,1);
run;
The above is equivalent to the former.
Sometimes it is better to try it out than to google:
data try_to_get_some_sleep;
format n unit z5.3 evening morning expect measure time18.9 diff percent7.2;
do unit = 5, 1, .5, .1, .01, .001;
do n = .1, .9, 1.1, 1.9, 2 ;
expect = n * unit;
evening = time();
call sleep (n, unit);
morning = time();
measure = morning - evening;
diff = (measure - expect) / expect;
output;
end;
end;
run;
results in
n unit evening morning expect measure diff
0.100 5.000 10:45:05.983414888 10:45:06.483422041 0:00:00.500000000 0:00:00.500007153 0.00%
0.900 5.000 10:45:06.483437061 10:45:10.984707117 0:00:04.500000000 0:00:04.501270056 0.03%
1.100 5.000 10:45:10.984720945 10:45:16.485454082 0:00:05.500000000 0:00:05.500733137 0.01%
1.900 5.000 10:45:16.485466003 10:45:25.984838009 0:00:09.500000000 0:00:09.499372005 (0.01%)
2.000 5.000 10:45:25.984853983 10:45:35.988686085 0:00:10.000000000 0:00:10.003832102 0.04%
0.100 1.000 10:45:35.988715887 10:45:36.088612080 0:00:00.100000000 0:00:00.099896193 (0.10%)
0.900 1.000 10:45:36.088624954 10:45:36.988639116 0:00:00.900000000 0:00:00.900014162 0.00%
1.100 1.000 10:45:36.988765001 10:45:38.089132071 0:00:01.100000000 0:00:01.100367069 0.03%
1.900 1.000 10:45:38.089145899 10:45:39.989645004 0:00:01.900000000 0:00:01.900499105 0.03%
2.000 1.000 10:45:39.989659071 10:45:41.989659071 0:00:02.000000000 0:00:02.000000000 0.00%
0.100 0.500 10:45:41.989671946 10:45:42.038803101 0:00:00.050000000 0:00:00.049131155 (1.74%)
0.900 0.500 10:45:42.038815975 10:45:42.488348961 0:00:00.450000000 0:00:00.449532986 (0.10%)
1.100 0.500 10:45:42.488362074 10:45:43.038013935 0:00:00.550000000 0:00:00.549651861 (0.06%)
1.900 0.500 10:45:43.038027048 10:45:43.987673044 0:00:00.950000000 0:00:00.949645996 (0.04%)
2.000 0.500 10:45:43.987685919 10:45:44.987751007 0:00:01.000000000 0:00:01.000065088 0.01%
0.100 0.100 10:45:44.987765074 10:45:44.996871948 0:00:00.010000000 0:00:00.009106874 (8.93%)
0.900 0.100 10:45:44.996876955 10:45:45.085994005 0:00:00.090000000 0:00:00.089117050 (0.98%)
1.100 0.100 10:45:45.086005926 10:45:45.195319891 0:00:00.110000000 0:00:00.109313965 (0.62%)
1.900 0.100 10:45:45.195332050 10:45:45.384675980 0:00:00.190000000 0:00:00.189343929 (0.35%)
2.000 0.100 10:45:45.384690046 10:45:45.585688114 0:00:00.200000000 0:00:00.200998068 0.50%
0.100 0.010 10:45:45.585701942 10:45:45.585707903 0:00:00.001000000 0:00:00.000005960 (99.4%)
0.900 0.010 10:45:45.585709095 10:45:45.595653057 0:00:00.009000000 0:00:00.009943962 10.5%
1.100 0.010 10:45:45.595659971 10:45:45.607652903 0:00:00.011000000 0:00:00.011992931 9.03%
1.900 0.010 10:45:45.607661009 10:45:45.626678944 0:00:00.019000000 0:00:00.019017935 0.09%
2.000 0.010 10:45:45.626689911 10:45:45.646678925 0:00:00.020000000 0:00:00.019989014 (0.05%)
0.100 0.001 10:45:45.646688938 10:45:45.646688938 0:00:00.000100000 0:00:00.000000000 ( 100%)
0.900 0.001 10:45:45.646689892 10:45:45.646689892 0:00:00.000900000 0:00:00.000000000 ( 100%)
1.100 0.001 10:45:45.646691084 10:45:45.647506952 0:00:00.001100000 0:00:00.000815868 (25.8%)
1.900 0.001 10:45:45.647507906 10:45:45.647620916 0:00:00.001900000 0:00:00.000113010 (94.1%)
2.000 0.001 10:45:45.647623062 10:45:45.650509119 0:00:00.002000000 0:00:00.002886057 44.3%
You can directly suspend a SAS session between steps using %SYSCALL SLEEP(... or %SYSFUNC(SLEEP(... NOTE: When using %SYSCALL you need to pass macro variables as the arguments, not numeric literal text.
Example:
%let duration = 5;
%let unit = 0.001;
data one; set sashelp.class; run;
%syscall sleep(duration,unit);
data two; set sashelp.cars; run;
or
data one; set sashelp.class; run;
%let rc = %sysfunc ( sleep ( 5, 0.001 ));
data two; set sashelp.cars; run;

MPI_Scatterv in Fortran crashes with Abort trap signal

I have write a master-slave IO function with Fortran. First, I read the file with 0 process, put the data in the array read_buffer, and then I call subroutine"scatter_data".
I have created some communicators to scatter data from process 0 to process 0-8.
These communicators are like this:
2, 5, 8 :sub_io_communicator (In this sub_io_communicator , sub_iorank is /0, 1, 2/ )
1, 4, 7 :sub_io_communicator (In this sub_io_communicator , sub_iorank is /0, 1, 2/ )
0, 3, 6 :sub_io_communicator (In this sub_io_communicator , sub_iorank is /0, 1, 2/ )
2, 1, 0 :master_communicator
DATA
||
0 process read
||
0 call MPI_scatterv in communicator "master_communicator"
/ | \
0 1 2 call MPI_scatterv in communicator "sub_io_communicator"
/ | \ / | \ / | \
0 3 6 1 4 7 2 5 8
but when I call MPI_Scatterv, it crashes. I use "print" to debug it , and find the bug is in "call MPI_Scatterv". SO, I write a very simple MPI_Scatterv in this subroutine to see whether it will work, but it does not.
My code is like this :
SUBROUTINE scatter_data(read_buffer,ne_in)
use naqpms_nest, only : nest, nxlo, nylo, ratio, nx, ny
implicit none
include 'mpif.h'
integer, INTENT(IN) :: ne_in
integer :: ierr
integer :: location
integer :: ii, jj, kk, zz, dd, send_size,receive_size
real, INTENT(IN), dimension(nx(ne_in)*ny(ne_in)) :: read_buffer
integer, dimension(nx(ne_in)*ny(ne_in)) :: read_buffer_int
real, allocatable :: rerange_buffer(:)
integer, allocatable :: receive_buffer_int(:)
integer , allocatable :: counts_recv(:),displacements(:)
integer :: distance,left_bdy,left_bdy2,rigth_bdy,rigth_bdy2
integer :: tmp1(3), tmp2(3)
IF( sub_iorank.EQ.0 ) THEN
if(allocated(counts_recv))deallocate(counts_recv)
if(allocated(displacements))deallocate(displacements)
if(allocated(receive_buffer))deallocate(receive_buffer)
if(allocated(rerange_buffer))deallocate(rerange_buffer)
IF( master_iorank.EQ.0) THEN
allocate(counts_recv(dims(2,ne_in)))
allocate(displacements(dims(2,ne_in)))
ENDIF
print*,"location",my_rank,nx(ne_in), ey(ne_in), sy(ne_in)
receive_size = nx(ne_in) * (ey(ne_in)-sy(ne_in)+1)
allocate(receive_buffer(receive_size))
allocate(receive_buffer_int(receive_size))
allocate(rerange_buffer(receive_size))
CALL MPI_Gather(receive_size, 1, MPI_INTEGER, counts_recv, 1, MPI_INTEGER,&
0, master_communicator ,ierr )
tmp1(1:3)= (/1,1,1/)
tmp2(1:3)= (/0,1,2/)
IF(my_rank==0)print*,"counts_recv",counts_recv
CALL mpi_scatterv (counts_recv, tmp1, tmp2, MPI_INT,& ! this is just for test
receive_size, 1, mpi_int, 0, master_communicator, ierr )
print*,my_rank,receive_size
IF(my_rank.EQ.0) THEN
displacements(1)=0
do ii=2, dims(2,ne_in)
displacements(ii) = displacements(ii-1) + counts_recv(ii-1)
enddo
ENDIF
IF(my_rank==0)print*,displacements,counts_recv
CALL mpi_scatterv (read_buffer, counts_recv, displacements, mpi_real,&
receive_buffer, receive_size, mpi_real, 0, master_communicator, ierr )
IF(my_rank==0)print*,"mpi_scatterv one ok",my_rank
ENDIF!sub_iorank =0
IF(sub_iorank .EQ. 0) THEN
if(allocated(counts_recv))deallocate(counts_recv)
if(allocated(displacements))deallocate(displacements)
allocate(counts_recv(dims(1,ne_in)))
allocate(displacements(dims(1,ne_in)))
ENDIF
receive_size =(ex(ne_in)-sx(ne_in)+1)*(ey(ne_in)-sy(ne_in)+1)
CALL MPI_Gather(receive_size, 1, MPI_INTEGER, counts_recv, 1, MPI_INTEGER, 0, sub_io_communicator ,ierr )
IF(sub_iorank .EQ. 0) THEN
displacements(1)=0
do ii=2,dims(1,ne_in)
displacements(ii) = displacements(ii-1) + counts_recv(ii-1)
enddo
ENDIF
IF(sub_iorank .EQ. 0) THEN
DO dd = 1, dims(1,ne_in)
DO jj = 1, bdy_gather(4,dd,ne_in)-bdy_gather(3,dd,ne_in)+1
distance = bdy_gather(2,dd,ne_in)-bdy_gather(1,dd,ne_in)+1
left_bdy = (jj-1)*nx(ne_in) + bdy_gather(1,dd,ne_in)
rigth_bdy = (jj-1)*nx(ne_in) + bdy_gather(2,dd,ne_in)
left_bdy2 = displacements(dd) + (jj-1)*distance + 1
rigth_bdy2 = displacements(dd) + (jj-1)*distance + distance
rerange_buffer( left_bdy2 : rigth_bdy2) = receive_buffer(left_bdy : rigth_bdy )
ENDDO
ENDDO
ENDIF
if(allocated(receive_buffer))deallocate(receive_buffer)
allocate(receive_buffer(receive_size))
IF(sub_iorank .EQ. 0) print*, my_rank, counts_recv, displacements
CALL mpi_scatterv( rerange_buffer, counts_recv, displacements, mpi_real,&
receive_buffer, receive_size, mpi_real, 0, sub_io_communicator,ierr)
IF(sub_iorank .EQ. 0) print*,"mpi_scatterv ok"
END SUBROUTINE scatter_data
I run this code : mpirun -np 9 ./gnaqpms.v1.6.0_jx0307.exe
then, the error in the log file is like this :
location 0 88 26 1
location 1 88 52 27
location 2 88 77 53
counts_recv 2288 2288 2200
1 2288
2 2200
*** Error in forrtl: error (76): Abort trap signal
Image PC Routine Line Source
gnaqpms.v1.6.0_jx 00000000007A5F3A Unknown Unknown Unknown
libpthread-2.17.s 00002BA00DADA5D0 Unknown Unknown Unknown
libc-2.17.so 00002BA00E01F207 gsignal Unknown Unknown
libc-2.17.so 00002BA00E0208F8 abort Unknown Unknown
libc-2.17.so 00002BA00E061D27 Unknown Unknown Unknown
libc-2.17.so 00002BA00E06A489 Unknown Unknown Unknown
libmpi.so.12.0 00002BA00CAC2AED Unknown Unknown Unknown
libmpi.so.12.0 00002BA00CAC4A54 Unknown Unknown Unknown
libmpi.so.12 00002BA00CAC3188 MPI_Scatterv Unknown Unknown
libmpifort.so.12. 00002BA00D445A7A mpi_scatterv Unknown Unknown
gnaqpms.v1.6.0_jx 0000000000475EA4 naqpms_parallel_m 1269 naqpms_parallel.f90
gnaqpms.v1.6.0_jx 00000000005953BF rd_met_pyramid_ 151 rd_met_pyramid.f90
gnaqpms.v1.6.0_jx 0000000000617709 read_data_ 61 naqpms_readdata.f90
gnaqpms.v1.6.0_jx 0000000000647BA4 naqpms_calc_mp_ca 141 naqpms_calc.f90
gnaqpms.v1.6.0_jx 000000000065A835 MAIN__ 86 main.f90
gnaqpms.v1.6.0_jx 000000000040B45E Unknown Unknown Unknown
libc-2.17.so 00002BA00E00B3D5 __libc_start_main Unknown Unknown
gnaqpms.v1.6.0_jx 000000000040B369 Unknown Unknown Unknown

Linux user process spending 99% of time in kernel thread, no syscalls, still running but slowly

I have a process that is getting "stuck" in a loop of pure user code. Both htop and time show the process is spending most of it's time in the kernel, but it isn't making any syscalls.
This behavior only occurs on half of invocations. When it happens, the loop takes several minutes to run. When it does not happen, it takes less than a second.
strace shows no differences between a good and bad invocation, and only a few system calls are made by the allocator.
The code in question is a photogrammetry suite called AliceVision. The exact loop is here on github: https://github.com/alicevision/AliceVision/blob/develop/src/aliceVision/track/TracksBuilder.cpp#L89
for(const IndexedFeaturePair& featPair: allFeatures)
{
lemon::ListDigraph::Node node = _d->graph.addNode();
map_indexToNode.insert(std::make_pair(featPair, node));
_d->map_nodeToIndex.insert(std::make_pair(node, featPair));
}
This code is just building some stl and boost containers, and none of the called methods have any locking. allFeatures has about 9 million entries.
I've gone through all the usual debugging techniques, gdb call stacks look fine (there are two other threads from libgomp, but they are just sitting on a FUTEX_WAIT syscall). perf shows the same results regardless of whether it gets stuck in the kernel or not.
My remaining hunch is weird scheduler behaviour. The machine is a 16 core/32 thread Ryzen 1950x threadripper with 48GB ram running ubuntu 20.04, kernel 5.4.0-generic.
Playing with nice values seems to have no effect.
Ideas? Thank you.
uname
uname -a
Linux snorlax 5.4.0-40-generic #44-Ubuntu SMP Tue Jun 23 00:01:04 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
lscpu
lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
Address sizes: 43 bits physical, 48 bits virtual
CPU(s): 32
On-line CPU(s) list: 0-31
Thread(s) per core: 2
Core(s) per socket: 16
Socket(s): 1
NUMA node(s): 2
Vendor ID: AuthenticAMD
CPU family: 23
Model: 1
Model name: AMD Ryzen Threadripper 1950X 16-Core Processor
Stepping: 1
Frequency boost: enabled
CPU MHz: 1988.628
CPU max MHz: 3400.0000
CPU min MHz: 2200.0000
BogoMIPS: 6786.43
Virtualization: AMD-V
L1d cache: 512 KiB
L1i cache: 1 MiB
L2 cache: 8 MiB
L3 cache: 32 MiB
NUMA node0 CPU(s): 0-7,16-23
NUMA node1 CPU(s): 8-15,24-31
Vulnerability Itlb multihit: Not affected
Vulnerability L1tf: Not affected
Vulnerability Mds: Not affected
Vulnerability Meltdown: Not affected
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2: Mitigation; Full AMD retpoline, STIBP disabled, RSB filling
Vulnerability Srbds: Not affected
Vulnerability Tsx async abort: Not affected
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_op
t pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid amd_dcm aperfmperf pni pclmulqdq monitor ssse3 fm
a cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3d
nowprefetch osvw skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb hw_pstate sme ssbd sev vmmcall fs
gsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt sha_ni xsaveopt xsavec xgetbv1 xsaves clzero irperf xsaveerptr arat npt
lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif overf
low_recov succor smca
strace output
197475 mmap(NULL, 298352640, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7ff476378000
197476 mmap(NULL, 298352640, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7ff4646f0000
197477 write(2, "[09:14:30.993443][debug] all fea"..., 44) = 44
197478 write(2, "\n", 1) = 1
197479 brk(0x555b0d02d000) = 0x555b0d02d000
197480 brk(0x555b0d05d000) = 0x555b0d05d000
197481 mmap(NULL, 266240, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7ff4a61e4000
197482 brk(0x555b0d01e000) = 0x555b0d01e000
197483 mmap(NULL, 528384, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7ff4a6163000
197484 munmap(0x7ff4a61e4000, 266240) = 0
197485 mmap(NULL, 1052672, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7ff4a6062000
197486 munmap(0x7ff4a6163000, 528384) = 0
197487 mmap(NULL, 2101248, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7ff4a5e61000
197488 munmap(0x7ff4a6062000, 1052672) = 0
197489 mmap(NULL, 4198400, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7ff4a5a60000
197490 munmap(0x7ff4a5e61000, 2101248) = 0
197491 mmap(NULL, 8392704, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7ff4a525f000
197492 munmap(0x7ff4a5a60000, 4198400) = 0
197493 mmap(NULL, 16781312, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7ff4a425e000
197494 munmap(0x7ff4a525f000, 8392704) = 0
197495 mmap(NULL, 33558528, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7ff48dfff000
197496 munmap(0x7ff4a425e000, 16781312) = 0
197497 mmap(NULL, 67112960, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7ff4606ef000
197498 munmap(0x7ff48dfff000, 33558528) = 0
197499 mmap(NULL, 134221824, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7ff4586ee000
197500 munmap(0x7ff4606ef000, 67112960) = 0
197501 mmap(NULL, 268439552, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7ff4486ed000
197502 munmap(0x7ff4586ee000, 134221824) = 0
# runs for several minutes
htop

Why only one core is reported on Ubuntu VirtualBox(found when using std::async)?

I was playing with std::async() on my Ubuntu on VirtualBox. For some reason with std::asyc() for two tasks I was still not getting CPU usage more than 100%. I have mentioned std::launch:async flag for launching async tasks.
I was expecting more than 100% usage as I have six core AMD ryzen 5. When I printed cpu info on ubuntu I get following
#include<iostream>
#include<random>
#include<set>
#include<algorithm>
#include<future>
std::set<int> unique_random(uint32_t numElem)
{
std::set<int> retVal;
std::random_device rd;
std::mt19937 gen(rd());
std::uniform_int_distribution<> dist(0, numElem-1);
std::generate_n(std::inserter(retVal, retVal.end()), numElem, [&](){ return dist(gen);});
return retVal;
}
int main()
{
std::cout << "uniue random numbers in first run : " << std::async(std::launch::async, unique_random, 1000000).get().size() << '\n' <<"uniue random numbers in Second run : "<< std::async(std::launch::async, unique_random, 1000000).get().size() <<'\n';;
return 0;
}
Compilation : g++ async_test.cpp -pthread -o3
time measure : /usr/bin/time ./a.out
#cat /proc/cpuinfo
2552ms  Fri 29 May 2020 02:15:08 AM UTC
processor : 0
vendor_id : AuthenticAMD
cpu family : 23
model : 8
model name : AMD Ryzen 5 2600X Six-Core Processor
stepping : 2
microcode : 0x6000626
cpu MHz : 3593.248
cache size : 512 KB
physical id : 0
siblings : 1
core id : 0
***cpu cores : 1***
apicid : 0
initial apicid : 0
fpu : yes
fpu_exception : yes
cpuid level : 13
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid tsc_known_freq pni pclmulqdq monitor ssse3 cx16 sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx rdrand hypervisor lahf_lm cr8_legacy abm sse4a misalignsse 3dnowprefetch ssbd vmmcall fsgsbase avx2 rdseed clflushopt arat
bugs : fxsave_leak sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass
bogomips : 7186.49
TLB size : 2560 4K pages
clflush size : 64
cache_alignment : 64
address sizes : 48 bits physical, 48 bits virtual
power management:
Here if you see it only reports 1 core. So I do know why I am not getting CPU usage more than 100% but why Ubuntu is showing only 1 core ?
Is it some bug with VirtualBox or am I missing something here?
Thanks

Reading float type numbers from a .txt file and using them in formulas

I am trying to read numbers from a .txt file and use them in formulas within my code and output the results into another .txt file so I can use them easier. My problem is reading the 3000 lines of numbers and assigning them the variable to use in the formulas. The first column I need to the variable dt and the second column I would like it to be the variable i. For whatever reason, I can't get them to read correctly. Here is my code.
#include <stdlib.h>
#include <stdio.h>
void GetAcc(const float Acc[], const float Vel[], float Pos[], float dt);
void GetVel(const float Acc[], float Vel[], float dt);
int main()
{
float Position[3000]={0};
float Velocity[3000]={0};
float Acceleration[3000]={0};
float i;
float dt;
FILE *fp;
fp=fopen("M3 Array File.txt", "r+");
fscanf(fp,"%f",&dt);
fscanf(fp,"%f",&i);
GetVel(Acceleration,Velocity,dt);
GetAcc(Acceleration,Velocity,Position,dt);
for(i=0;i<3000;i++) fprintf(fp,"%3.3f\t%3.3f\t%3.3f\t%3.3f\n",dt*(float)i,Acceleration[i],Velocity[i],Position[i]);
fclose(fp);
return 0;
}
void GetVel(const float Acc[], float Vel[], float dt)
{
int i;
for(i=1;i<3000;i++)
{
Vel[i]=Vel[i-1]+Acc[i]*dt;
}
}
void GetAcc(const float Acc[], const float Vel[], float Pos[], float dt)
{
int i;
for(i=1;i<3000;i++)
{
Acc[i]=2*Pos[i]-Pos[i-1]-Vel[i-1];
}
}
Here is an example of what the file looks like.
0.000 0.000
0.001 0.000
0.002 0.000
0.003 0.000
0.004 0.000
0.005 0.000
0.006 0.000
0.007 0.000
0.008 0.000
0.009 0.000
0.010 0.000
0.011 0.000
0.012 0.000
0.013 0.000
0.014 0.000
0.015 0.000
0.016 0.000
0.017 0.000
0.018 0.000
0.019 0.000
0.020 0.000
0.021 0.000
0.022 0.000
0.023 0.000
0.024 0.000
0.025 0.000
0.026 0.000
0.027 0.000
0.028 0.000
0.029 0.000
0.030 0.002
0.031 0.003
0.032 0.005
0.033 0.005
0.034 0.005
0.035 0.005
0.036 0.005
0.037 0.006
0.038 0.008
0.039 0.009
0.040 0.011
0.041 0.012
0.042 0.012
0.043 0.012
0.044 0.012
0.045 0.014
0.046 0.015
0.047 0.017
0.048 0.018
0.049 0.020
0.050 0.021
0.051 0.023
0.052 0.025
0.053 0.026
0.054 0.028
0.055 0.029
0.056 0.031
0.057 0.032
0.058 0.034
0.059 0.035
0.060 0.037
0.061 0.040
0.062 0.043
0.063 0.046
0.064 0.049
0.065 0.052
0.066 0.054
0.067 0.055
0.068 0.057
0.069 0.060
0.070 0.063
0.071 0.066
0.072 0.069
0.073 0.072
0.074 0.075
0.075 0.078
0.076 0.081
0.077 0.084
0.078 0.087
0.079 0.091
0.080 0.095
0.081 0.100
0.082 0.104
0.083 0.109
0.084 0.114
0.085 0.117
0.086 0.120
0.087 0.123
0.088 0.127
0.089 0.132
0.090 0.137
0.091 0.141
0.092 0.146
0.093 0.150
0.094 0.155
0.095 0.161
0.096 0.167
0.097 0.173
0.098 0.179
0.099 0.184
0.100 0.189
0.101 0.193
0.102 0.199
0.103 0.206
0.104 0.212
0.105 0.218
0.106 0.224
0.107 0.230
0.108 0.236
0.109 0.242
0.110 0.250
0.111 0.258
0.112 0.265
0.113 0.273
0.114 0.281
0.115 0.288
0.116 0.296
0.117 0.304
0.118 0.311
0.119 0.319
0.120 0.327
0.121 0.334
0.122 0.342
0.123 0.350
0.124 0.359
0.125 0.368
0.126 0.377
0.127 0.387
0.128 0.396
0.129 0.405
0.130 0.414
0.131 0.423
0.132 0.433
0.133 0.442
0.134 0.451
0.135 0.460
0.136 0.471
0.137 0.482
0.138 0.492
0.139 0.503
0.140 0.514
0.141 0.525
0.142 0.535
0.143 0.546
0.144 0.557
0.145 0.569
0.146 0.581
0.147 0.594
0.148 0.606
0.149 0.618
0.150 0.630
0.151 0.643
0.152 0.655
0.153 0.667
0.154 0.680
0.155 0.692
0.156 0.706
0.157 0.719
0.158 0.733
0.159 0.747
0.160 0.761
0.161 0.775
0.162 0.788
0.163 0.802
0.164 0.816
0.165 0.830
0.166 0.845
0.167 0.861
0.168 0.876
0.169 0.891
0.170 0.907
0.171 0.922
0.172 0.937
0.173 0.953
0.174 0.969
0.175 0.986
0.176 1.003
0.177 1.020
0.178 1.037
0.179 1.054
0.180 1.071
0.181 1.088
0.182 1.104
0.183 1.121
0.184 1.140
0.185 1.158
0.186 1.177
0.187 1.195
0.188 1.213
0.189 1.232
0.190 1.250
0.191 1.269
0.192 1.287
0.193 1.307
0.194 1.327
0.195 1.347
0.196 1.367
0.197 1.387
0.198 1.407
0.199 1.427
0.200 1.447
0.201 1.468
0.202 1.489
0.203 1.511
0.204 1.532
0.205 1.554
0.206 1.575
0.207 1.597
0.208 1.618
0.209 1.640
0.210 1.663
0.211 1.686
0.212 1.709
0.213 1.732
0.214 1.755
0.215 1.778
0.216 1.801
0.217 1.824
0.218 1.848
0.219 1.873
0.220 1.898
0.221 1.922
0.222 1.947
0.223 1.971
0.224 1.996
0.225 2.020
0.226 2.045
0.227 2.071
0.228 2.097
0.229 2.123
0.230 2.149
0.231 2.175
0.232 2.201
0.233 2.227
0.234 2.255
0.235 2.283
0.236 2.310
0.237 2.338
0.238 2.365
0.239 2.393
0.240 2.421
0.241 2.448
0.242 2.476
0.243 2.505
0.244 2.534
0.245 2.563
0.246 2.592
0.247 2.622
0.248 2.651
0.249 2.680
0.250 2.709
0.251 2.740
0.252 2.770
0.253 2.801
0.254 2.832
0.255 2.862
0.256 2.893
0.257 2.924
0.258 2.954
0.259 2.987
0.260 3.019
0.261 3.051
0.262 3.083
0.263 3.116
0.264 3.148
0.265 3.180
0.266 3.212
0.267 3.246
0.268 3.280
0.269 3.313
0.270 3.347
0.271 3.381
0.272 3.415
0.273 3.448
0.274 3.482
0.275 3.517
0.276 3.553
0.277 3.588
0.278 3.623
0.279 3.659
0.280 3.694
0.281 3.729
0.282 3.764
0.283 3.800
0.284 3.836
0.285 3.873
0.286 3.910
0.287 3.947
0.288 3.984
0.289 4.021
0.290 4.057
0.291 4.094
0.292 4.133
0.293 4.171
0.294 4.209
0.295 4.248
0.296 4.286
0.297 4.324
0.298 4.363
0.299 4.401
0.300 4.439
0.301 4.479
0.302 4.519
0.303 4.559
0.304 4.599
0.305 4.639
0.306 4.679
0.307 4.719
0.308 4.758
0.309 4.800
0.310 4.841
0.311 4.883
0.312 4.924
0.313 4.965
0.314 5.007
0.315 5.048
0.316 5.090
0.317 5.131
0.318 5.174
0.319 5.217
0.320 5.260
0.321 5.303
0.322 5.346
0.323 5.389
0.324 5.432
0.325 5.475
0.326 5.518
0.327 5.562
0.328 5.607
0.329 5.651
0.330 5.696
0.331 5.740
0.332 5.785
0.333 5.829
0.334 5.874
0.335 5.918
0.336 5.963
0.337 6.009
0.338 6.055
0.339 6.101
0.340 6.147
0.341 6.193
0.342 6.239
0.343 6.285
0.344 6.331
0.345 6.378
0.346 6.426
0.347 6.473
You can use tokenize in c++
Spliting string into two with blank space
#include <vector>
#include <string>
#include <sstream>
#include <iterator>
string line;
istringstream buf(line);
istream_iterator<string> beg(buf), end;
vector<string> tokens(beg, end);
cout << "1st sub String : " << tokens[0];
cout << "2nd sub String : " << tokens[1];
Casting string to an int
int n;
stringstream num(tokens[1]);
num >> n;
These snippets might help you, Ref :
https://binaramedawatta.blogspot.com/2018/12/supporting-code-snippets-for-oop-take.html
https://www.geeksforgeeks.org/converting-strings-numbers-cc/
http://www.cplusplus.com/forum/beginner/87238/