The following code is my code for calculating pi = 3.1415... approximately using this formula:
use Time;
var timer = new Timer();
config const n = 10**9;
var x = 0.0, s = 0.0;
// timer.start(); // [1]_____
for k in 0 .. n {
s = ( if k % 2 == 0 then 1.0 else -1.0 ); // (-1)^k
x += s / ( 2.0 * k + 1.0 );
}
// timer.stop(); // [2]_____
// writeln( "time = ", timer.elapsed() ); // [3]_____
writef( "pi (approx) = %30.20dr\n", x * 4 );
// writef( "pi (exact) = %30.20dr\n", pi ); // [4]_____
When the above code is compiled as chpl --fast test.chpl and executed as time ./a.out, then it runs with ~4 seconds as
pi (approx) = 3.14159265458805059268
real 0m4.334s
user 0m4.333s
sys 0m0.006s
On the other hand, if I uncomment Lines [1--3] ( to use Timer ), the program runs much slower with ~10 seconds as
time = 10.2284
pi (approx) = 3.14159265458805059268
real 0m10.238s
user 0m10.219s
sys 0m0.018s
The same slow-down occurs when I uncomment only Line [4] ( to print the built-in value of pi, with Lines [1-3] kept commented out ):
pi (approx) = 3.14159265458805059268
pi (exact) = 3.14159265358979311600
real 0m10.144s
user 0m10.141s
sys 0m0.009s
So I'm wondering why this slow-down occurs...
Am I missing something in the above code (e.g., wrong usage of Timer)?
My environment is OSX10.11 + chapel-1.16 installed via homebrew.
More details are below:
$ printchplenv --anonymize
CHPL_TARGET_PLATFORM: darwin
CHPL_TARGET_COMPILER: clang
CHPL_TARGET_ARCH: native
CHPL_LOCALE_MODEL: flat
CHPL_COMM: none
CHPL_TASKS: qthreads
CHPL_LAUNCHER: none
CHPL_TIMERS: generic
CHPL_UNWIND: none
CHPL_MEM: jemalloc
CHPL_MAKE: make
CHPL_ATOMICS: intrinsics
CHPL_GMP: gmp
CHPL_HWLOC: hwloc
CHPL_REGEXP: re2
CHPL_WIDE_POINTERS: struct
CHPL_AUX_FILESYS: none
$ clang --version
Apple LLVM version 8.0.0 (clang-800.0.42.1)
Target: x86_64-apple-darwin15.6.0
Thread model: posix
InstalledDir: /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin
Update
Following the suggestions, I installed Chapel from source by following this and this pages and adding CHPL_TARGET_COMPILER=gnu to ~/.chplconfig (before running make). Then, all the three cases above ran with ~4 seconds. So, the problem may be related to clang on OSX10.11. According to the comments, newer OSX (>= 10.12) does not have this problem, so it may be simply sufficient to upgrade to newer OSX/clang (>= 9.0). FYI, the updated environment info (with GNU) is as follows:
$ printchplenv --anonymize
CHPL_TARGET_PLATFORM: darwin
CHPL_TARGET_COMPILER: gnu +
CHPL_TARGET_ARCH: native
CHPL_LOCALE_MODEL: flat
CHPL_COMM: none
CHPL_TASKS: qthreads
CHPL_LAUNCHER: none
CHPL_TIMERS: generic
CHPL_UNWIND: none
CHPL_MEM: jemalloc
CHPL_MAKE: make
CHPL_ATOMICS: intrinsics
CHPL_GMP: none
CHPL_HWLOC: hwloc
CHPL_REGEXP: none
CHPL_WIDE_POINTERS: struct
CHPL_AUX_FILESYS: none
Am I missing something in the above code (e.g., wrong usage of Timer)?
No, you're not missing anything and are using Timer (and Chapel) in a completely reasonable way. From my own experimentation (which confirms yours and is noted in the comments under your question), this looks to be a back-end compiler issue rather than a fundamental problem in Chapel or your use of it.
[--fast] reduces run-time checks, yet not the issue may re-run here
Kindly may also note, how big are setup/operation add-on overheads,
brought in just for educational purposes
( to experiment with concurrent-processing ), that make the forall-constructor equipped with Atomics .add() method, accrue a way much higher overheads, than a concurrent-processing allow to gain, as there is so tiny computation inside the [PAR]-enabled fraction of the process ( ref. newly re-formulated Amdahl's Law on these too thin [PAR]-gains v/s indeed too high add-on overheads to the [SEQ]-costs ).
An exemplary message.
use Time;
var timer = new Timer();
config const n = 10**9;
var s = 0.0, x = 0.0;
var AtomiX: atomic real; // [AtomiX]______
AtomiX.write( 0.0 ); // [AtomiX]______
timer.start(); // [1]_____
for k in 0 .. n {
s = ( if k % 2 == 0 then 1.0 else -1.0 ); // (-1)^k
x += s / ( 2.0 * k + 1.0 );
}
/* forall k in 0..n { AtomiX.add( ( if k % 2 == 0 then 1.0 else -1.0 )
/ ( 2.0 * k + 1.0 )
); } */ // [AtomiX]______
timer.stop(); // [2]_____
writeln( "time = ", timer.elapsed() ); // [3]_____
writef( "pi (approx) = %30.20dr\n", 4 * x );
// writef( "pi (approx) = %30.20dr\n", 4 * AtimiX.read() ); // [AtomiX]______
// writef( "pi (exact) = %30.20dr\n", pi ); // [4]_____
/*
--------------------------------------------------- [--fast] // AN EMPTY RUN
time = 1e-06
Real time: 9.582 s
User time: 8.479 s
Sys. time: 0.591 s
CPU share: 94.65 %
Exit code: 0
--------------------------------------------------- [--fast] // all commented
pi (approx) = 3.14159265458805059268
Real time: 15.553 s
User time: 13.484 s ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~> Timer ~ +/- 1 second ( O/S noise )
Sys. time: 0.985 s
CPU share: 93.03 %
Exit code: 0
-------------------------------------------------- [--fast ] // Timer-un-commented
time = 5.30128
time = 5.3329
pi (approx) = 3.14159265458805059268
Real time: 14.356 s
User time: 13.047 s ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~< Timer ~ +/- 1 second ( O/S noise )
Sys. time: 0.585 s
CPU share: 94.95 %
Exit code: 0
Real time: 16.804 s
User time: 14.853 s
Sys. time: 0.925 s
CPU share: 93.89 %
Exit code: 0
-------------------------------------------------- [--fast] // Timer-un-commented + forall + Atomics
time = 14.7406
pi (approx) = 3.14159265458805680993
Real time: 28.099 s
User time: 26.246 s
Sys. time: 0.914 s
CPU share: 96.65 %
Exit code: 0
*/
Related
I am trying to learn CUDA and I am now stuck at running a simple nvprof command.
I am testing a simple script in both C++ and Fortran using CUDA. The CUDA kernels test two different ways of performing a simple task with the intent to show the importance of the branch divergence issue.
When I run
nvprof --metrics branch_efficiency ./codeCpp.x (i.e., on the c++ code) the command works but when I try the same thing on the corresponding fortran code, it doesn't. What is curious is that a simple <nvprof ./codeFortran.x> on the fortran executable will show an output, but anything with the <--metrics> flag will not.
Below I paste some info: (note both codes compile fine and do not produce any runtime error).
I am using Ubuntu 20
Anyone can help to understand this issue? Thank you
===================== c++ code
#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include <stdio.h>
#include <stdlib.h>
#include "cuda.h"
#include "device_launch_parameters.h"
#include "cuda_common.cuh"
// kernel without divergence
__global__ void code_without_divergence(){
// compute unique grid index
int gid = blockIdx.x * blockDim.x + threadIdx.x;
// define some local variables
float a, b;
a = b = 0.0;
// compute the warp index
int warp_id = gid/32;
// conditional statement based on the warp id
if (warp_id % 2 == 0)
{
a = 150.0;
b = 50.0;
}
else
{
a = 200.0;
b = 75.0;
};
}
// kernel with divergence
__global__ void code_with_divergence(){
// compute unique grid index
int gid = blockIdx.x * blockDim.x + threadIdx.x;
// define some local variables
float a, b;
a = b = 0.0;
// conditional statement based on the gid. This will force difference
// code branches within the same warp.
if (gid % 2 == 0)
{
a = 150.0;
b = 50.0;
}
else
{
a = 200.0;
b = 75.0;
};
}
int main (int argc, char** argv){
// set the block size
int size = 1 << 22;
dim3 block_size(128);
dim3 grid_size((size + block_size.x-1)/block_size.x);
code_without_divergence <<< grid_size, block_size>>>();
cudaDeviceSynchronize();
code_with_divergence <<<grid_size, block_size>>>();
cudaDeviceSynchronize();
cudaDeviceReset();
return EXIT_SUCCESS;
};
================ Fortran code
MODULE CUDAUtils
USE cudafor
IMPLICIT NONE
CONTAINS
! code without divergence routine
ATTRIBUTES(GLOBAL) SUBROUTINE code_without_divergence()
IMPLICIT NONE
!> local variables
INTEGER :: threadId, warpIdx
REAL(KIND=8) :: a,b
! get the unique threadID
threadId = (blockIdx%y-1) * gridDim%x * blockDim%x + &
(blockIdx%x-1) * blockDim%x + (threadIdx%x-1)
! adjust so that the threadId starts from 1
threadId = threadId + 1
! warp index
warpIdx = threadIdx%x/32
! perform the conditional statement
IF (MOD(warpIdx,2) == 0) THEN
a = 150.0D0
b = 50.0D0
ELSE
a = 200.0D0
b = 75.0D0
END IF
END SUBROUTINE code_without_divergence
! code with divergence routine
ATTRIBUTES(GLOBAL) SUBROUTINE code_with_divergence()
IMPLICIT NONE
!> local variables
INTEGER :: threadId, warpIdx
REAL(KIND=8) :: a,b
! get the unique threadID
threadId = (blockIdx%y-1) * gridDim%x * blockDim%x + &
(blockIdx%x-1) * blockDim%x + (threadIdx%x-1)
! adjust so that the threadId starts from 1
threadId = threadId + 1
! perform the conditional statement
IF (MOD(threadId,2) == 0) THEN
a = 150.0D0
b = 50.0D0
ELSE
a = 200.0D0
b = 75.0D0
END IF
END SUBROUTINE code_with_divergence
END MODULE CUDAUtils
PROGRAM main
USE CUDAUtils
IMPLICIT NONE
! define the variables
INTEGER :: size1 = 1e20
INTEGER :: istat
TYPE(DIM3) :: grid, tBlock
! blocksize is 42 along the 1st dimension only whereas grid is 2D
tBlock = DIM3(128,1,1)
grid = DIM3((size1 + tBlock%x)/tBlock%x,1,1)
! just call the module
CALL code_without_divergence<<<grid,tBlock>>>()
istat = cudaDeviceSynchronize()
! just call the module
CALL code_with_divergence<<<grid,tBlock>>>()
istat = cudaDeviceSynchronize()
STOP
END PROGRAM main
Output of nvprof --metrics branch_efficiency ./codeCpp.x
=6944== NVPROF is profiling process 6944, command: ./codeCpp.x
==6944== Profiling application: ./codeCpp.x
==6944== Profiling result:
==6944== Metric result:
Invocations Metric Name Metric Description Min Max Avg
Device "NVIDIA GeForce MX330 (0)"
Kernel: code_without_divergence(void)
1 branch_efficiency Branch Efficiency 100.00% 100.00% 100.00%
Kernel: code_with_divergence(void)
1 branch_efficiency Branch Efficiency 85.71% 85.71% 85.71%
Output of nvprof --metrics branch_efficiency ./codeFortran.x
==6983== NVPROF is profiling process 6983, command: ./codeFortran.x
==6983== Profiling application: ./codeFortran.x
==6983== Profiling result:
No events/metrics were profiled.
Output of nvprof ./codeFortran.x
==7002== NVPROF is profiling process 7002, command: ./codeFortran.x
==7002== Profiling application: ./codeFortran.x
==7002== Profiling result:
No kernels were profiled.
Type Time(%) Time Calls Avg Min Max Name
API calls: 99.82% 153.45ms 2 76.726ms 516ns 153.45ms cudaLaunchKernel
0.15% 231.24us 101 2.2890us 95ns 172.81us cuDeviceGetAttribute
0.01% 22.522us 1 22.522us 22.522us 22.522us cuDeviceGetName
0.01% 9.1310us 1 9.1310us 9.1310us 9.1310us cuDeviceGetPCIBusId
0.00% 5.4500us 2 2.7250us 876ns 4.5740us cudaDeviceSynchronize
0.00% 1.3480us 3 449ns 195ns 903ns cuDeviceGetCount
0.00% 611ns 1 611ns 611ns 611ns cuModuleGetLoadingMode
0.00% 520ns 2 260ns 117ns 403ns cuDeviceGet
0.00% 245ns 1 245ns 245ns 245ns cuDeviceTotalMem
0.00% 187ns 1 187ns 187ns 187ns cuDeviceGetUuid
Both the c++ and Fortran executables test the same CUDA concept. They both compile fine and no runtime errors are shown on the terminal upon execution.
When I try the nvprof command on the c++ program, everything works as expected but when I try it on the corresponding fortran program, there is no output (using the --metrics flag). I would expect the same behavior obtained with the c++ code.
0
In some other discussions I found that for GPU version above 7, nvprof is no longer supported and NVIDIA Nsight should be used, however i do not think this is the case because i get the expected output with the c++ program.
The reason the code was not profiling as expected was that the kernels were not actually running correctly in that case.
It's always good practice to make sure there are no runtime errors with a code before attempting any profiling. Proper CUDA error checking and compute-sanitizer are two methods to help with this.
I try to create an MQL4-script (an almost C++ related language, MQL4) where I want to divide a double value into 9 parts, where the fractions would be unequal, yet increasing
My current code attempts to do it this way (pseudo-code) :
Lots1 = 0.1;
Lots2 = (Lots1 / 100) * 120;//120% of Lot1
Lots3 = (Lots2 / 100) * 130;//130% of Lot2
Lots4 = (Lots3 / 100) * 140;//140% of Lot3
Lots5 = (Lots4 / 100) * 140;//140% of Lot4
Lots6 = (Lots5 / 100) * 160;//160% of Lot5
Lots7 = (Lots6 / 100) * 170;//170% of Lot6
Lots8 = (Lots7 / 100) * 180;//180% of Lot7
Lots9 = (Lots8 / 100) * 190;//190% of Lot8
...
or better :
double Lots = 0.1; // a Lot Size
double lot = Lots;
...
/* Here is the array with percentages of lots' increments
in order */
int AllZoneLots[8] = { 120, 130, 140, 140, 160, 170, 180, 190 }; // 120%, 130%,...
/* Here, the lot sizes are used by looping the array
and increasing the lot size by the count */
for( int i = 0; i < ArraySize( AllZoneLots ); i++ ) {
lots = AllZoneLots[i] * ( lots / 100 ) *;
// PlaceOrder( OP_BUY, lots );
}
But, what I want is to just have a fixed value of 6.7 split into 9 parts, like these codes do, yet to have the value increasing, rather than being same...
e.g, 6.7 split into :
double lots = { 0.10, 0.12, 0.16, 0.22, 0.31, 0.50, 0.85, 1.53, 2.91 };
/* This is just an example
of how to divide a value of 6.7 into 9, growing parts
This can be done so as to make equal steps in the values. If there are 9 steps, divide the value by 45 to get the first value, and the equal step x. Why? Because the sum of 1..9 is 45.
x = 6.7 / 45
which is 0.148889
The first term is x, the second term is 2 * x, the third term is 3 * x etc. They add up to 45 * x which is 6.7, but it's better to divide last. So the second term, say, would be 6.7 * 2 / 45;
Here is code which shows how it can be done in C, since MQL4 works with C Syntax:
#include <stdio.h>
int main(void) {
double val = 6.7;
double term;
double sum = 0;
for(int i = 1; i <= 9; i++) {
term = val * i / 45;
printf("%.3f ", term);
sum += term;
}
printf("\nsum = %.3f\n", sum);
}
Program output:
0.149 0.298 0.447 0.596 0.744 0.893 1.042 1.191 1.340
sum = 6.700
Not sure I understood right, but probably you need total of 3.5 shared between all lots.
And I can see only 8 lots not counting initial one.
totalPercentage = 0;
for(int i = 0; i < ArraySize(AllZoneLots); i++) {
totalPercentage += AllZoneLots[i];
}
double totalValue = 3.5;
// total value is total percentage, Lots1 - 100%, so:
Lots1 = totalValue / totalPercentage * 100.00;
Then you continue with your code.
If you want to include Lots1, you just add 100 to the total and do the same.
Q : How to divide a number into several, unequal, yet increasing numbers [ for sending a PlaceOrder( OP_BUY, lots ) contract XTO ]?
A : The problem is not as free as it might look for a first sight :
In metatrader Terminal ecosystem, the problem formulation has also to obey the externally decided factors ( that are mandatory for any XTO with an ambition not to get rejected, as being principally incompatible with the XTO Terms & Conditions set, and to get filled ~ "placed" At Market )
These factors are reportable via a call to:
MarketInfo( <_a_SymbolToReportSTRING>, MODE_MINLOT ); // a minimum permitted size
MarketInfo( <_a_SymbolToReportSTRING>, MODE_LOTSTEP ); // a mandatory size-stepping
MarketInfo( <_a_SymbolToReportSTRING>, MODE_MAXLOT ); // a maximum permitted size
Additionally, any such lot-size has to be prior of submitting an XTO also "normalised" for given number of decimal places, so as to successfully placed / accepted by the Trading-Server on the Broker's side. A failure to do so results in remotely rejected XTO-s ( which obviously come at a remarkable blocking / immense code-execution latency penalty one would always want to prevent from ever happening in real trading )
Last, but not least, any such XTO sizing has to be covered by a safe amount of leveraged equity ( checking the free-margin availability first, before ever sending any such XTO for reasons just mentioned above ).
The code:
While the initial pseudo-code above, does a progressive ( Martingale-alike ) lot-size scaling:
>>> aListOfFACTORs = [ 100, 120, 130, 140, 140, 160, 170, 180, 190 ]
>>> for endPoint in range( len( aListOfFACTORs ) ):
... product = 1.
... for item in aListOfFACTORs[:1+endPoint]:
... product *= item / 100.
... print( "Lots{0:} ~ ought be about {1:} times the amount of Lots1".format( 1 + endPoint, product ) )
...
Lots1 ~ ought be about 1.0 times the amount of Lots1
Lots2 ~ ought be about 1.2 times the amount of Lots1
Lots3 ~ ought be about 1.56 times the amount of Lots1
Lots4 ~ ought be about 2.184 times the amount of Lots1
Lots5 ~ ought be about 3.0576 times the amount of Lots1
Lots6 ~ ought be about 4.89216 times the amount of Lots1
Lots7 ~ ought be about 8.316672 times the amount of Lots1
Lots8 ~ ought be about 14.9700096 times the amount of Lots1
Lots9 ~ ought be about 28.44301824 times the amount of Lots1
the _MINLOT, _LOTSTEP and _MAXLOT put the game into a new light.
Any successful strategy is not free to chose the sizes. Given the said 9-steps and a fixed amount of the total-amount ~ 6.7 lots, the process can obey the stepping and total, plus, it must obey the MarketInfo()-reported sizing algebra
Given 9-steps are mandatory,
each one has to be at least _MINLOT-sized:
double total_amount_to_split = aSizeToSPLIT;
total_amount_to_split = Min( aSizeToSPLIT, // a wished-to-have-sizing
FreeMargin/LotInBaseCurr*sFty // a FreeMargin-covered size
);
int next = 0;
while ( total_amount_to_split >= _MINLOT )
{ total_amount_to_split -= _MINLOT;
lot_size[next++] = _MINLOT;
}
/*
###################################################################################
------------------------------------------------- HERE, WE HAVE 0:next lot_sizes
next NEED NOT == 9
If there is anything yet to split:
there is an integer amount of _LOTSTEP-s to distribute among 'em
HERE, and ONLY here, you have a freedom to decide about split/mapping
of the integer amount of _LOTSTEP-sized
additions to the _MINLOT "pre"-sets
in lot_size[]-s
YET, still no more than _MAXLOT is permissible for the above explained reasons
------------------------------------------------- CODE has to obey this, if XTO-s
are to
get a chance
###################################################################################
*/
I am creating a random distribution of points in Fortran, and this is being done by a do while loop. I want to speed up this process via OpenMP, but I read that you can't simply use !$OMP PARALLEL DO for do while loops. I tried converting my original do while into a do loop nested in the do while. However, I can't see any speedups in the code,by this I mean it takes the same time as the serial version. I can't seem to figure out what the issue is and I've been stuck, would appreciate any advice. I've shown the code below.
The original loop:
!OMP PARALLEL DO
do while (count < size(zeta_list,2))
call random_number(x)
call random_number(y)
x1 = a + FLOOR((b+1-a)*x)
y1 = a + FLOOR((b+1-a)*y)
if (abs(y1) <= abs(1/x1)) then
count = count + 1
call random_number(theta)
zeta_list(1,count) = x1*sin(2*pi_16*theta)
zeta_list(2,count) = x1*cos(2*pi_16*theta)
end if
end do
!OMP END PARALLEL DO
and after I tried to convert it,
!$OMP PARALLEL
do while (count < size(zeta_list,2))
!$OMP DO
do i=1,size(zeta_list,2),1
call random_number(x)
call random_number(y)
x1 = a + FLOOR((b+1-a)*x)
y1 = a + FLOOR((b+1-a)*y)
if (abs(y1) <= abs(1/x1)) then
call random_number(theta)
count = count + 1
zeta_list(1,i) = x1*sin(2*pi_16*theta)
zeta_list(2,i) = x1*cos(2*pi_16*theta)
end if
end do
!$OMP END DO
end do
!$OMP END PARALLEL
The entire code is
PROGRAM RANDOM_DISTRIBUTION
IMPLICIT NONE
DOUBLE PRECISION, DIMENSION(2,1000000)::zeta_list
DOUBLE PRECISION::x,y,x1,y1,theta
REAL::a,b,n
INTEGER::count,t1,t2,clock_rate,clock_max,i
DOUBLE PRECISION,PARAMETER::pi_16=4*atan(1.0_16)
call system_clock ( t1, clock_rate, clock_max )
n = 1000
b = n/2
a = -n/2
count = 0
zeta_list = 0
x = 0
y = 0
x1 = 0
y1 = 0
theta = 0
call random_seed()
!$OMP PARALLEL
do while (count < size(zeta_list,2))
!$OMP DO
do i=1,size(zeta_list,2),1
call random_number(x)
call random_number(y)
x1 = a + FLOOR((b+1-a)*x)
y1 = a + FLOOR((b+1-a)*y)
if (abs(y1) <= abs(1/x1)) then
call random_number(theta)
count = count + 1
zeta_list(1,i) = x1*sin(2*pi_16*theta)
zeta_list(2,i) = x1*cos(2*pi_16*theta)
end if
end do
!$OMP END DO
end do
!$OMP END PARALLEL
call system_clock ( t2, clock_rate, clock_max )
write ( *, * ) 'Elapsed real time = ', real ( t2 - t1 ) / real ( clock_rate) ,'seconds'
stop
END PROGRAM RANDOM_DISTRIBUTION
compiled with gfortran test.f90 -fopenmp
Instead of performing a hard-to distribute while loop, I propose the following: use a loop over the array index.
I suppose that you want to generate random samples in the array zeta_list. I moved the while in the parallel loop.
Still, beware that you need a "OpenMP-aware" PRNG. This is the case in recent gfortran versions, I don't know for other compilers.
I also changed the 1.0_16 into a a 1.0d0 as fixed numeric constants are not a good way to specify the kind parameter in general and reduced the size of the static array.
PROGRAM RANDOM_DISTRIBUTION
IMPLICIT NONE
DOUBLE PRECISION, DIMENSION(2,100000)::zeta_list
DOUBLE PRECISION::x,y,x1,y1,theta
REAL::a,b,n
INTEGER::count,t1,t2,clock_rate,clock_max,i
DOUBLE PRECISION,PARAMETER::pi_16=4*atan(1.0d0)
call system_clock ( t1, clock_rate, clock_max )
n = 1000
b = n/2
a = -n/2
count = 0
zeta_list = 0
x = 0
y = 0
x1 = 0
y1 = 0
theta = 0
call random_seed()
!$OMP PARALLEL DO private(i, x, y, x1, y1, theta)
do i = 1, size(zeta_list, 2)
inner_loop: do
call random_number(x)
call random_number(y)
x1 = a + FLOOR((b+1-a)*x)
y1 = a + FLOOR((b+1-a)*y)
if (abs(y1) <= abs(1/x1)) then
call random_number(theta)
zeta_list(1,i) = x1*sin(2*pi_16*theta)
zeta_list(2,i) = x1*cos(2*pi_16*theta)
exit inner_loop
end if
end do inner_loop
end do
!$OMP END PARALLEL DO
write(*,*) zeta_list(:,1)
write(*,*) zeta_list(:,2)
call system_clock ( t2, clock_rate, clock_max )
write ( *, * ) 'Elapsed real time = ', real ( t2 - t1 ) / real ( clock_rate) ,'seconds'
END PROGRAM RANDOM_DISTRIBUTION
The use of random_number in OpenMP threads is safe for gfortran 5 but you need gfortran 7 to get a threaded random number generator. I list the timing with two cores:
user#pc$ gfortran-5 -O3 -Wall -fopenmp -o prd prd.f90
user#pc$ OMP_NUM_THREADS=1 ./prd
47.496326386583306 237.29327630545950
-101.11803913888293 147.70288474064185
Elapsed real time = 3.47700000 seconds
user#pc$ OMP_NUM_THREADS=2 ./prd
0.0000000000000000 -0.0000000000000000
-160.53394672041205 49.526275353269853
Elapsed real time = 12.1479998 seconds
user#pc$ rm fort.1*
user#pc$ gfortran-5 -O3 -Wall -fopenmp -o prd prd.f90
user#pc$ OMP_NUM_THREADS=1 ./prd
Elapsed real time = 3.05100012 seconds
user#pc$ OMP_NUM_THREADS=2 ./prd
Elapsed real time = 9.09599972 seconds
user#pc$ gfortran-6 -O3 -Wall -fopenmp -o prd prd.f90
user#pc$ OMP_NUM_THREADS=1 ./prd
Elapsed real time = 3.09200001 seconds
user#pc$ OMP_NUM_THREADS=2 ./prd
Elapsed real time = 12.3350000 seconds
user#pc$ gfortran-7 -O3 -Wall -fopenmp -o prd prd.f90
user#pc$ OMP_NUM_THREADS=1 ./prd
Elapsed real time = 1.83200002 seconds
user#pc$ OMP_NUM_THREADS=2 ./prd
Elapsed real time = 0.986999989 seconds
The result is quite obvious: prior to gfortran 7 OpenMP-ing the code here slows it down significantly.
I am looking at the example given in the manual of nloptr.
I replaced the last part of the code by
local_opts <- list( "algorithm" = "NLOPT_LD_MMA",
"xtol_rel" = 0.0,
"ftol_rel" = 0.0,
"ftol_abs" = 0.0,
"xtol_abs" = 0.0)
opts <- list( "algorithm" = "NLOPT_LD_AUGLAG",
"xtol_rel" = 0.0,
"ftol_rel" = 0.0,
"ftol_abs" = 0.0,
"xtol_abs" = 0.0,
"maxeval" = 100000,
"local_opts" = local_opts )
res <- nloptr( x0=x0,
eval_f=eval_f,
lb=lb,
ub=ub,
eval_g_ineq=eval_g_ineq,
eval_g_eq=eval_g_eq,
opts=opts)
print( res )
That is, I changed xtol/ftol rel/abs to all be 0, for both the main solver and the local solver.Note both of them use gradient based algorithm. I also increased the max number of steps from 1k to 100k.
However the solver terminates much earlier, at 3k steps.
Call: nloptr(x0 = x0, eval_f = eval_f, lb = lb, ub = ub, eval_g_ineq =
eval_g_ineq,
eval_g_eq = eval_g_eq, opts = opts)
Minimization using NLopt version 2.4.2
NLopt solver status: 3 ( NLOPT_FTOL_REACHED: Optimization stopped
because ftol_rel or ftol_abs (above) was reached. )
Number of Iterations....: 3132 Termination conditions: xtol_rel:
0 ftol_rel: 0 ftol_abs: 0 xtol_abs: 0 maxeval: 1e+05 Number of
inequality constraints: 1 Number of equality constraints: 1
Optimal value of objective function: 17.0140172891563 Optimal value
of controls: 1 4.743 3.82115 1.379408
Looking at the c++ implementation of nlopt, it seems that this should not happen. The termination conditions given by the various tolerance levels are in terms of strict inequality. So am I missing something about the meaning of the solver status "NLOPT_FTOL_REACHED: Optimization stopped because ftol_rel or ftol_abs (above) was reached." ?
Thanks!
John
I am new to Matlab and would appreciate any assistance possible!
I am running a simulation and so the results vary with each run of the simulation. I want to collect the results for analysis.
For example, during the first simulation run, the level of a plasma coagulation factor may vary over 5 hours as such:
R(1) = [1.0 0.98 0.86 0.96 0.89]
In the second run, the levels at each time period may be slightly different, eg.
R(2) = [1.0 0.95 0.96 0.89 0.86]
I would like to (perhaps by using the parfor function) to create a matrix eg.
R = [1.0 0.98 0.86 0.96 0.89
1.0 0.95 0.96 0.89 0.86]
I have encountered problems ranging from "In an assignment A(I) = B, the number of elements in B and I must be the same" to getting a matrix of zeros or ones (depending on what I use for the preallocation).
I will need the simulation to run about 10000 times in order to collect a meaningful amount of results.
Can anyone suggest how this might be achieved? A detailed guidance or (semi)complete code would be much appreciated for someone new to Matlab like me.
Thanks in advance!
This is my actual code, and as you can see, there are 4 variables that vary over 744 hours (31 days) which I would like to individually collect:
Iterations = 10000;
PGINR = zeros(Iterations, 744);
PGAmount = zeros(Iterations, 744);
CAINR = zeros(Iterations, 744);
CAAmount = zeros(Iterations, 744);
for iii = 1:Iterations
[{PGINR(iii)}, {PGAmount(iii)}, {CAINR(iii)}, {CAAmount(iii)}] = ChineseTTRSimulationB();
end
filename = 'ChineseTTRSimulationResults.xlsx';
xlswrite(filename, PGINR, 2)
xlswrite(filename, PGAmount, 3)
xlswrite(filename, CAINR, 5)
xlswrite(filename, CAAmount, 6)
Are you looking for something like this?
I simplified a little bit your code for better understanding and added some dummy data, function.
main.m
Iterations = 10;
PGINR = zeros(Iterations, 2);
PGAmount = zeros(Iterations, 2);
%fake data
x = rand(Iterations,1);
y = rand(Iterations,1);
parfor iii = 1:Iterations
[PGINR(iii,:), PGAmount(iii,:)] = ChineseTTRSimulationB(x(iii), y(iii));
end
ChineseTTRSimulationB.m
function [PGINRi, PGAmounti] = ChineseTTRSimulationB(x,y)
PGINRi = [x + y, x];
PGAmounti = [x*y, y];
end
save each parfor-result in cells, and combine them later.
Iterations = 10000;
PGINR = cell(1, Iterations);
PGAmount = cell(1, Iterations);
CAINR = cell(1, Iterations);
CAAmount = cell(1, Iterations);
parfor i = 1:Iterations
[PGINR{i}, PGAmount{i}, CAINR{i}, CAAmount{i}] = ChineseTTRSimulationB();
end
PGINR = cell2mat(PGINR); % 1x7440000 vector
%and so on...