Using Parfor to create matrix from a vector - c++

I am new to Matlab and would appreciate any assistance possible!
I am running a simulation and so the results vary with each run of the simulation. I want to collect the results for analysis.
For example, during the first simulation run, the level of a plasma coagulation factor may vary over 5 hours as such:
R(1) = [1.0 0.98 0.86 0.96 0.89]
In the second run, the levels at each time period may be slightly different, eg.
R(2) = [1.0 0.95 0.96 0.89 0.86]
I would like to (perhaps by using the parfor function) to create a matrix eg.
R = [1.0 0.98 0.86 0.96 0.89
1.0 0.95 0.96 0.89 0.86]
I have encountered problems ranging from "In an assignment A(I) = B, the number of elements in B and I must be the same" to getting a matrix of zeros or ones (depending on what I use for the preallocation).
I will need the simulation to run about 10000 times in order to collect a meaningful amount of results.
Can anyone suggest how this might be achieved? A detailed guidance or (semi)complete code would be much appreciated for someone new to Matlab like me.
Thanks in advance!
This is my actual code, and as you can see, there are 4 variables that vary over 744 hours (31 days) which I would like to individually collect:
Iterations = 10000;
PGINR = zeros(Iterations, 744);
PGAmount = zeros(Iterations, 744);
CAINR = zeros(Iterations, 744);
CAAmount = zeros(Iterations, 744);
for iii = 1:Iterations
[{PGINR(iii)}, {PGAmount(iii)}, {CAINR(iii)}, {CAAmount(iii)}] = ChineseTTRSimulationB();
end
filename = 'ChineseTTRSimulationResults.xlsx';
xlswrite(filename, PGINR, 2)
xlswrite(filename, PGAmount, 3)
xlswrite(filename, CAINR, 5)
xlswrite(filename, CAAmount, 6)

Are you looking for something like this?
I simplified a little bit your code for better understanding and added some dummy data, function.
main.m
Iterations = 10;
PGINR = zeros(Iterations, 2);
PGAmount = zeros(Iterations, 2);
%fake data
x = rand(Iterations,1);
y = rand(Iterations,1);
parfor iii = 1:Iterations
[PGINR(iii,:), PGAmount(iii,:)] = ChineseTTRSimulationB(x(iii), y(iii));
end
ChineseTTRSimulationB.m
function [PGINRi, PGAmounti] = ChineseTTRSimulationB(x,y)
PGINRi = [x + y, x];
PGAmounti = [x*y, y];
end

save each parfor-result in cells, and combine them later.
Iterations = 10000;
PGINR = cell(1, Iterations);
PGAmount = cell(1, Iterations);
CAINR = cell(1, Iterations);
CAAmount = cell(1, Iterations);
parfor i = 1:Iterations
[PGINR{i}, PGAmount{i}, CAINR{i}, CAAmount{i}] = ChineseTTRSimulationB();
end
PGINR = cell2mat(PGINR); % 1x7440000 vector
%and so on...

Related

How to divide a number into several, unequal, yet increasing numbers [ for sending a PlaceOrder( OP_BUY, lots ) contract XTO ]

I try to create an MQL4-script (an almost C++ related language, MQL4) where I want to divide a double value into 9 parts, where the fractions would be unequal, yet increasing
My current code attempts to do it this way (pseudo-code) :
Lots1 = 0.1;
Lots2 = (Lots1 / 100) * 120;//120% of Lot1
Lots3 = (Lots2 / 100) * 130;//130% of Lot2
Lots4 = (Lots3 / 100) * 140;//140% of Lot3
Lots5 = (Lots4 / 100) * 140;//140% of Lot4
Lots6 = (Lots5 / 100) * 160;//160% of Lot5
Lots7 = (Lots6 / 100) * 170;//170% of Lot6
Lots8 = (Lots7 / 100) * 180;//180% of Lot7
Lots9 = (Lots8 / 100) * 190;//190% of Lot8
...
or better :
double Lots = 0.1; // a Lot Size
double lot = Lots;
...
/* Here is the array with percentages of lots' increments
in order */
int AllZoneLots[8] = { 120, 130, 140, 140, 160, 170, 180, 190 }; // 120%, 130%,...
/* Here, the lot sizes are used by looping the array
and increasing the lot size by the count */
for( int i = 0; i < ArraySize( AllZoneLots ); i++ ) {
lots = AllZoneLots[i] * ( lots / 100 ) *;
// PlaceOrder( OP_BUY, lots );
}
But, what I want is to just have a fixed value of 6.7 split into 9 parts, like these codes do, yet to have the value increasing, rather than being same...
e.g, 6.7 split into :
double lots = { 0.10, 0.12, 0.16, 0.22, 0.31, 0.50, 0.85, 1.53, 2.91 };
/* This is just an example
of how to divide a value of 6.7 into 9, growing parts
This can be done so as to make equal steps in the values. If there are 9 steps, divide the value by 45 to get the first value, and the equal step x. Why? Because the sum of 1..9 is 45.
x = 6.7 / 45
which is 0.148889
The first term is x, the second term is 2 * x, the third term is 3 * x etc. They add up to 45 * x which is 6.7, but it's better to divide last. So the second term, say, would be 6.7 * 2 / 45;
Here is code which shows how it can be done in C, since MQL4 works with C Syntax:
#include <stdio.h>
int main(void) {
double val = 6.7;
double term;
double sum = 0;
for(int i = 1; i <= 9; i++) {
term = val * i / 45;
printf("%.3f ", term);
sum += term;
}
printf("\nsum = %.3f\n", sum);
}
Program output:
0.149 0.298 0.447 0.596 0.744 0.893 1.042 1.191 1.340
sum = 6.700
Not sure I understood right, but probably you need total of 3.5 shared between all lots.
And I can see only 8 lots not counting initial one.
totalPercentage = 0;
for(int i = 0; i < ArraySize(AllZoneLots); i++) {
totalPercentage += AllZoneLots[i];
}
double totalValue = 3.5;
// total value is total percentage, Lots1 - 100%, so:
Lots1 = totalValue / totalPercentage * 100.00;
Then you continue with your code.
If you want to include Lots1, you just add 100 to the total and do the same.
Q : How to divide a number into several, unequal, yet increasing numbers [ for sending a PlaceOrder( OP_BUY, lots ) contract XTO ]?
A : The problem is not as free as it might look for a first sight :
In metatrader Terminal ecosystem, the problem formulation has also to obey the externally decided factors ( that are mandatory for any XTO with an ambition not to get rejected, as being principally incompatible with the XTO Terms & Conditions set, and to get filled ~ "placed" At Market )
These factors are reportable via a call to:
MarketInfo( <_a_SymbolToReportSTRING>, MODE_MINLOT ); // a minimum permitted size
MarketInfo( <_a_SymbolToReportSTRING>, MODE_LOTSTEP ); // a mandatory size-stepping
MarketInfo( <_a_SymbolToReportSTRING>, MODE_MAXLOT ); // a maximum permitted size
Additionally, any such lot-size has to be prior of submitting an XTO also "normalised" for given number of decimal places, so as to successfully placed / accepted by the Trading-Server on the Broker's side. A failure to do so results in remotely rejected XTO-s ( which obviously come at a remarkable blocking / immense code-execution latency penalty one would always want to prevent from ever happening in real trading )
Last, but not least, any such XTO sizing has to be covered by a safe amount of leveraged equity ( checking the free-margin availability first, before ever sending any such XTO for reasons just mentioned above ).
The code:
While the initial pseudo-code above, does a progressive ( Martingale-alike ) lot-size scaling:
>>> aListOfFACTORs = [ 100, 120, 130, 140, 140, 160, 170, 180, 190 ]
>>> for endPoint in range( len( aListOfFACTORs ) ):
... product = 1.
... for item in aListOfFACTORs[:1+endPoint]:
... product *= item / 100.
... print( "Lots{0:} ~ ought be about {1:} times the amount of Lots1".format( 1 + endPoint, product ) )
...
Lots1 ~ ought be about 1.0 times the amount of Lots1
Lots2 ~ ought be about 1.2 times the amount of Lots1
Lots3 ~ ought be about 1.56 times the amount of Lots1
Lots4 ~ ought be about 2.184 times the amount of Lots1
Lots5 ~ ought be about 3.0576 times the amount of Lots1
Lots6 ~ ought be about 4.89216 times the amount of Lots1
Lots7 ~ ought be about 8.316672 times the amount of Lots1
Lots8 ~ ought be about 14.9700096 times the amount of Lots1
Lots9 ~ ought be about 28.44301824 times the amount of Lots1
the _MINLOT, _LOTSTEP and _MAXLOT put the game into a new light.
Any successful strategy is not free to chose the sizes. Given the said 9-steps and a fixed amount of the total-amount ~ 6.7 lots, the process can obey the stepping and total, plus, it must obey the MarketInfo()-reported sizing algebra
Given 9-steps are mandatory,
each one has to be at least _MINLOT-sized:
double total_amount_to_split = aSizeToSPLIT;
total_amount_to_split = Min( aSizeToSPLIT, // a wished-to-have-sizing
FreeMargin/LotInBaseCurr*sFty // a FreeMargin-covered size
);
int next = 0;
while ( total_amount_to_split >= _MINLOT )
{ total_amount_to_split -= _MINLOT;
lot_size[next++] = _MINLOT;
}
/*
###################################################################################
------------------------------------------------- HERE, WE HAVE 0:next lot_sizes
next NEED NOT == 9
If there is anything yet to split:
there is an integer amount of _LOTSTEP-s to distribute among 'em
HERE, and ONLY here, you have a freedom to decide about split/mapping
of the integer amount of _LOTSTEP-sized
additions to the _MINLOT "pre"-sets
in lot_size[]-s
YET, still no more than _MAXLOT is permissible for the above explained reasons
------------------------------------------------- CODE has to obey this, if XTO-s
are to
get a chance
###################################################################################
*/

Plot variables out of differential equations system function

I have a 4-4 differential equations system in a function (subsystem4) and I solved it with odeint funtion. I managed to plot the results of the system. My problem is that I want to plot and some other equations (e.g. x,y,vcxdot...) which are included in the same function (subsystem4) but I get NameError: name 'vcxdot' is not defined. Also, I want to use some of these equations (not only the results of the equation's system) as inputs in a following differential equations system and plot all the equations in the same period of time (t). I have done this using Matlab-Simulink but it was much easier because of Simulink blocks. How can I have access to and plot all the equations of a function (subsystem4) and use them as input in a following system? I am new in python and I use Python 2.7.12. Thank you in advance!
import numpy as np
from scipy.integrate import odeint
import matplotlib.pyplot as plt
def subsystem4(u,t):
added_mass_x = 0.03 # kg
added_mass_y = 0.04
mb = 0.3 # kg
m1 = mb-added_mass_x
m2 = mb-added_mass_y
l1 = 0.07 # m
l2 = 0.05 # m
J = 0.00050797 # kgm^2
Sa = 0.0110 # m^2
Cd = 2.44
Cl = 3.41
Kd = 0.000655 # kgm^2
r = 1000 # kg/m^3
f = 2 # Hz
c1 = 0.5*r*Sa*Cd
c2 = 0.5*r*Sa*Cl
c3 = 0.5*mb*(l1**2)
c4 = Kd/J
c5 = (1/(2*J))*(l1**2)*mb*l2
c6 = (1/(3*J))*(l1**3)*mb
vcx = u[0]
vcy = u[1]
psi = u[2]
wz = u[3]
x = 3 + 0.3*np.cos(t)
y = 0.5 + 0.3*np.sin(t)
xdot = -0.3*np.sin(t)
ydot = 0.3*np.cos(t)
xdotdot = -0.3*np.cos(t)
ydotdot = -0.3*np.sin(t)
vcx = xdot*np.cos(psi)-ydot*np.sin(psi)
vcy = ydot*np.cos(psi)+xdot*np.sin(psi)
psidot = wz
vcxdot = xdotdot*np.cos(psi)-xdot*np.sin(psi)*psidot-ydotdot*np.sin(psi)-ydot*np.cos(psi)*psidot
vcydot = ydotdot*np.cos(psi)-ydot*np.sin(psi)*psidot+xdotdot*np.sin(psi)+xdot*np.cos(psi)*psidot
g1 = -(m1/c3)*vcxdot+(m2/c3)*vcy*wz-(c1/c3)*vcx*np.sqrt((vcx**2)+(vcy**2))+(c2/c3)*vcy*np.sqrt((vcx**2)+(vcy**2))*np.arctan2(vcy,vcx)
g2 = (m2/c3)*vcydot+(m1/c3)*vcx*wz+(c1/c3)*vcy*np.sqrt((vcx**2)+(vcy**2))+(c2/c3)*vcx*np.sqrt((vcx**2)+(vcy**2))*np.arctan2(vcy,vcx)
A = 12*np.sin(2*np.pi*f*t+np.pi)
if A>=0.1:
wzdot = ((m1-m2)/J)*vcx*vcy-c4*wz**2*np.sign(wz)-c5*g2-c6*np.sqrt((g1**2)+(g2**2))
elif A<-0.1:
wzdot = ((m1-m2)/J)*vcx*vcy-c4*wz**2*np.sign(wz)-c5*g2+c6*np.sqrt((g1**2)+(g2**2))
else:
wzdot = ((m1-m2)/J)*vcx*vcy-c4*wz**2*np.sign(wz)-c5*g2
return [vcxdot,vcydot,psidot,wzdot]
u0 = [0,0,0,0]
t = np.linspace(0,15,1000)
u = odeint(subsystem4,u0,t)
vcx = u[:,0]
vcy = u[:,1]
psi = u[:,2]
wz = u[:,3]
plt.figure(1)
plt.subplot(211)
plt.plot(t,vcx,'r-',linewidth=2,label='vcx')
plt.plot(t,vcy,'b--',linewidth=2,label='vcy')
plt.plot(t,psi,'g:',linewidth=2,label='psi')
plt.plot(t,wz,'c',linewidth=2,label='wz')
plt.xlabel('time')
plt.legend()
plt.show()
To the immediate question of plotting the derivatives, you can get the velocities by directly calling the ODE function again on the solution,
u = odeint(subsystem4,u0,t)
udot = subsystem4(u.T,t)
and get the separate velocity arrays via
vcxdot,vcydot,psidot,wzdot = udot
In this case the function involves branching, which is not very friendly to vectorized calls of it. There are ways to vectorize branching, but the easiest work-around is to loop manually through the solution points, which is slower than a working vectorized implementation. This will again procude a list of tuples like odeint, so the result has to be transposed as a tuple of lists for "easy" assignment to the single array variables.
udot = [ subsystem4(uk, tk) for uk, tk in zip(u,t) ];
vcxdot,vcydot,psidot,wzdot = np.asarray(udot).T
This may appear to double somewhat the computation, but not really, as the solution points are usually interpolated from the internal step points of the solver. The evaluation of the ODE function during integration will usually happen at points that are different from the solution points.
For the other variables, extract the computation of position and velocities into functions to have the constant and composition in one place only:
def xy_pos(t): return 3 + 0.3*np.cos(t), 0.5 + 0.3*np.sin(t)
def xy_vel(t): return -0.3*np.sin(t), 0.3*np.cos(t)
def xy_acc(t): return -0.3*np.cos(t), -0.3*np.sin(t)
or similar that you can then use both inside the ODE function and in preparing the plots.
What Simulink most likely does is to collect all the equations of all the blocks and form this into one big ODE system which is then solved for the whole state at once. You will need to implement something similar. One big state vector, and each subsystem knows its slice of the state resp. derivatives vector to get its specific state variables from and write the derivatives to. The computation of the derivatives can then use values communicated among the subsystems.
What you are trying to do, solving the subsystems separately, works only for resp. will likely result in a order 1 integration method. All higher order methods need to be able to simultaneously shift the state in some direction computed from previous stages of the method, and evaluate the whole system there.

Convolution vs signal resolution

I realized that the resolution of the input signal dramatically affects the results of the convolution. I'm wondering if there is a way to compensate somehow for this. Let me give you an example:
Lets take the Sersic equation:
Sersic
with, for example, parameters.
Now we solve this equation both for a R step of 0.1 and 0.01. For example for the 1st point (R=0) we get \mu(0) = 9.82.
The next step is to convolve the data, after converting it into counts (to convert it to counts we can use this simple equation: Data(R) = 10^((\mu(R)-25)/(-2.5)). I'm using the bellow mentioned subroutine that I wrote but I tried with others and I get the same result (the PSF is Moffat with FWHM = 0.5 arcsec and its constructed in a way that its total area equals 1):
sum1 = 0
DO i = 1,n
sum1 = 0
g = i
DO f = 1,i
sum1(f) = Data(f)*PSF(g)
g = i - f
ENDDO
convData(i) = sum(sum1)
ENDDO
convData = convData(n:1:-1)
So, for this example, for the data with 0.1 resolution after convolution (and after reconverting the counts to \mu) I get for \mu(0)* = 13.52. For the data with 0.01 resolution I get \mu(0)* = 15.52. This is 2 magnitudes difference!! What am I doing wrong or how can I somehow compensate for this effect?
Thank you so much for the help!

Fast pnorm() computation on really long vector (of length ~1e+7 to ~1e+8)

Is there a way to optimize pnorm? I am having some bottleneck in my code and after a lot a optimization and benchmark I realized that it comes from the call to pnorm on really big vectors.
With microbenchmarking I got on my machine that if length(u) ~ 5e+7 then pnorm(u) takes 11 seconds.
Is there a way to use Rcpp here, or the built-in pnorm is already optimized ?
Any ideas welcomed.
I have found these posts on SO: Use pnorm from Rmath.h with Rcpp and How can I use qnorm on Rcpp?. But as far as I understood their purpose is to use the R functions into Cpp code.
In this session, I am going to demonstrate fast yet accurate approximation to pnorm().
Before we start, we need to be clear: what do we want to achieve by using approximation? Efficiency / speed / performance, right? But where would such efficiency come from?
As discussed above, pnorm() computation is memory-bound; even if we do approximate computation, it is still memory-bound (hence we don't consider further parallelization). Memory-bound problems have
number of floating point operations : memory access = O(1)
In other words, this ratio is some constant C. So our aim should be to reduce this constant, i.e., we want to reduce floating point operations.
Number of floating operations are often reported as the number of floating points addition and multiplication. Other types of floating point operations are "converted" to such measure. Now, let's compare the costs of several common floating point operations.
u <- sample(1:10/10, 5e+7, replace = TRUE)
system.time(u + u)
# user system elapsed
# 0.468 0.168 0.639
system.time(u * u)
# user system elapsed
# 0.424 0.212 0.638
system.time(u / u)
# user system elapsed
# 0.504 0.204 0.710
system.time(u ^ 1.1)
# user system elapsed
# 7.240 0.212 7.458
system.time(sqrt(u))
# user system elapsed
# 2.044 0.176 2.224
system.time(exp(u))
# user system elapsed
# 4.336 0.208 4.550
system.time(log(u))
# user system elapsed
# 2.748 0.172 2.925
system.time(round(u))
# user system elapsed
# 6.836 0.188 7.034
Note that addition and multiplication are cheap, root and logarithm are more expensive, while some operations are very expensive, including power, exponential and rounding.
Now let's get back to pnorm(), or even dnorm(), etc, where we have an exponential term to compute. Given that:
system.time(pnorm(u))
# user system elapsed
# 11.016 0.160 11.193
system.time(dnorm(u))
# user system elapsed
# 8.844 0.164 9.022
we see that the majority of time to compute pnorm() and dnorm() are attribute to computing exponential. pnorm() takes longer time than dnorm() because it further has an integral!
Now, our goal is fairly clear: we want to replace the expensive pnorm() evaluation with something really cheap, ideally only involving addition / multiplication. Can we??
There have been many approximation methods in history. #Ben has mentioned the logistic approximation. R function plogis() does this. But a careful read on ?plogis shows that it is still based on exponentials.
Now, instead of using those parametric approximation, can we do non-parametric approximation? But we should not be doing regression here; instead, we want to use some interpolation function of fine resolution accurate data, to predict pnorm(). Well, linear interpolation is the best choice, as it is super efficient (due to sparsity: the linear predictor matrix is tri-diagonal). In R, approx does this. I refer reader unfamiliar with this to ?approx, and I will simply proceed.
The OP says he only needs standard normal distribution, so we focus on this. Consider the following approximate function (I did not use approxfun because I want customizable h):
approx.pnorm <- function(u, h = 0.2) {
x <- seq(from = -4, to = 4, by = h)
approx(x, pnorm(x), yleft = 0, yright = 1, xout = u)$y
}
The accurate data are taken on a grid of resolution h between [-4, 4]. Predictions below -4 is 0, while predictions beyond 4 is 1. This satisfies the requirement of a CDF. Given new values u, we approximate pnorm(u) by linear interpolation based on known accurate data.
Obviously, the resolution h controls accuracy. Consider the following function to compute RMSE and display approximation curve:
RMSEh <- function(h) {
x <- sort(rnorm(1000))
y <- pnorm(x)
y1 <- approx.pnorm(x, h)
plot(x, y, type = "l", lwd = 2); lines(x, y1, col = 2, lwd = 2)
mean((y - y1) ^ 2)^0.5
}
par(mfrow = c(1, 3))
RMSEh(1) # 0.01570339
RMSEh(0.5) # 0.003968882
RMSEh(0.2) # 0.000639888
Actually, when h = 0.2, approximation is already fairly good. So we will use h = 0.2 in the following.
Benchmarking
This should be the most exciting part. In above we have seen that accurate computation of pnorm(u) takes 11 seconds. Now
system.time(approx.pnorm(u, h = 0.2))
# user system elapsed
# 2.656 0.172 2.833
Wow, we are nearly 4 times faster!!
I am not here to disappoint you, but pnorm is already optimized. If you type "pnorm" in your R console, you see how it is written:
function (q, mean = 0, sd = 1, lower.tail = TRUE, log.p = FALSE)
.Call(C_pnorm, q, mean, sd, lower.tail, log.p)
<bytecode: 0x98712e0>
<environment: namespace:stats>
It is already written in C (see Rmath.h).
Some people might then suggest you to do parallel computing. R level parallelism can use, for example, mclapply / parLapply / parSapply function from parallel package. But whether you should do this depends on what hardware you have.
It is a bad idea to parallelize pnorm() on a simple multi-core machine, as it is memory-bound. The ratio between CPU computation and memory reference is O(1) (using the big O notation). Furthermore, R level parallelism is not thread-level parallelism, but by setting up independent R processes. This means, parallel overhead is greater and not easy to be amortised.
If you have a cluster, you can do parallel computing on different nodes for really large problem. You will get good parallel scalability.
Further clarification on parallel processing
Assume u is a long vector: u[1], u[2], .... and we aim to compute pnorm(u). Each element u[i] is only brought from RAM to CPU once without a second use. Therefore, computation of pnorm() requires constant data read.
Now consider a multi-core machine with 4 physical CPUs (i.e., each with non-shared execution units, like registers, ALU, FPU, L1 cache, etc). We set up 4 threads or processes hoping to run 4 parallel pnorm() computation on 4 different chunks of u. During computation, every CPU is "data-hungry", and asking for constant data flow. However, there is only a single bus. If one CPU is occupying the bus, data flow for the rest three are cut off hence they have nothing to do. In other words, those 4 CPUs can almost never work at the same time, and they are no better than single CPU computation.
Now we move on to 4 nodes on a cluster. After initial data split to 4 different nodes, each node will be working in a single CPU mode. There is neither shared execution resources nor memory resources between 4 nodes. They can work completely parallel. In the end, results from 4 nodes are forged together. In this way, for really large problem, good / reasonable scalability can be guaranteed.
Parallel computing on multi-core machine is only useful for CPU-bound task (up to some extent, before bus becomes saturated). Specifically, we should use block algorithm for L1 caching. Caching achieves considerable data reuse. For example, for block matrix multiplication of block size nb, the ratio between CPU work and memory read is O(nb). In this way, after a CPU reading a block of data into its exclusive L1 cache, in comparatively a long period of time (in CPU cycles) it does not require access to RAM, so the bus becomes free. Then the other cores can take such gap / break to read the data they required. As only as only a limited number of CPUs are used, they can work in an interleave manner without mutual interference.
I am a bit surprised that the linear introplation shown in this answer is so slow. A solution is to use this package instead (created by Yixuan Qiu and updated by me) to do the interpolation. It can be installed by calling:
remotes::install_github("boennecd/fastncdf")
Updated version of old answer
In what follows is my old answer along with a new approximation using the fastncdf package.
Is there a way to use Rcpp here, or the built-in pnorm is already optimized ?
You can go with an approximation like in this answer. However, pnorm does seem to benefit from computation in parallel at least on my machine. Here is an example using OpenMP:
#include <Rcpp.h>
#include <Rmath.h>
#include <cmath>
// [[Rcpp::plugins(openmp)]]
#ifdef _OPENMP
#include <omp.h>
#endif
/**
* evaluates the standard normal CDF after avoiding some checks in the
* original version. Use with care!
*/
inline double pnorm_std(double const x, int lower, int is_log) {
if(std::isinf(x) || std::isnan(x))
return NAN;
double p, cp;
p = x;
Rf_pnorm_both(x, &p, &cp, lower ? 0 : 1, is_log);
return lower ? p : cp;
}
/** calls pnorm_std potentially in parallel. */
// [[Rcpp::export(rng = false)]]
Rcpp::NumericVector pnorm_std(Rcpp::NumericVector x,
unsigned const n_threads = 1){
R_len_t const n = x.size();
Rcpp::NumericVector out(n);
double const * const xb = &x[0];
double * const ob = &out[0];
#ifdef _OPENMP
#pragma omp parallel for num_threads(n_threads) schedule(static)
#endif
for(R_len_t i = 0; i < n; ++i)
*(ob + i) = pnorm_std(*(xb + i), 1L, 0L);
return out;
}
Using Rcpp::sourceCpp on the above, we can now compare the computation time and precision/that we get the same:
# simulate data
set.seed(1)
u <- rnorm(1e7)
# assign function to compare with from other answer
approx_pnorm <- function(u, h = 0.2) {
x <- seq(from = -4, to = 4, by = h)
approx(x, pnorm(x), yleft = 0, yright = 1, xout = u)$y
}
# check times and results. First using the new interpolation method
library(fastncdf)
system.time(lin_itr <- fastpnorm(u))
#R> user system elapsed
#R> 0.068 0.016 0.084
# w/ pre-allocated vector
dum <- rep(0., length(u))
system.time(fastpnorm_preallocated(u, p = dum))
all.equal(lin_itr, dum)
#R> user system elapsed
#R> 0.058 0.000 0.058
# then as in the original answer
system.time(truth <- pnorm(u))
#R> user system elapsed
#R> 0.368 0.008 0.376
system.time(mini_one <- pnorm_std(u, 1L))
#R> user system elapsed
#R> 0.265 0.016 0.281
system.time(mini_six <- pnorm_std(u, 4L))
#R> user system elapsed
#R> 0.265 0.024 0.092
system.time(other_ans <- approx_pnorm(u, h = 0.2))
#R> user system elapsed
#R> 0.272 0.004 0.277
# are the results identical?
all.equal(mini_one, truth)
#R> [1] TRUE
all.equal(other_ans, truth)
#R> [1] "Mean relative difference: 0.001062221"
all.equal(lin_itr, truth)
#R> [1] "Mean relative difference: 8.765925e-08"
# what about the times?
bench::mark(`R ` = pnorm (u),
`C++ (1 thread) ` = pnorm_std (u, 1L),
`C++ (2 threads) ` = pnorm_std (u, 2L),
`C++ (4 threads) ` = pnorm_std (u, 4L),
`C++ (6 threads) ` = pnorm_std (u, 6L),
`Other answer ` = approx_pnorm(u, h = 0.2),
`C++ interpolation` = fastpnorm (u),
min_time = 10, relative = TRUE, check = FALSE)
#R> # A tibble: 7 x 13
#R> expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc total_time result memory time gc
#R> <bch:expr> <dbl> <dbl> <dbl> <dbl> <dbl> <int> <dbl> <bch:tm> <list> <list> <list> <list>
#R> 1 R 5.23 5.01 1 1 1.33 21 6 7.84s <NULL> <Rprofmem[,3] [1 × 3]> <bch:tm [27]> <tibble [27 × 3]>
#R> 2 C++ (1 thread) 3.96 3.82 1.31 1 1.53 28 7 7.95s <NULL> <Rprofmem[,3] [1 × 3]> <bch:tm [35]> <tibble [35 × 3]>
#R> 3 C++ (2 threads) 2.18 2.10 2.39 1 2.06 54 10 8.43s <NULL> <Rprofmem[,3] [1 × 3]> <bch:tm [64]> <tibble [64 × 3]>
#R> 4 C++ (4 threads) 1.29 1.26 3.98 1 2.62 92 13 8.63s <NULL> <Rprofmem[,3] [1 × 3]> <bch:tm [105]> <tibble [105 × 3]>
#R> 5 C++ (6 threads) 1 1 5.01 1 3.48 114 17 8.49s <NULL> <Rprofmem[,3] [1 × 3]> <bch:tm [131]> <tibble [131 × 3]>
#R> 6 Other answer 3.86 3.76 1.33 1.00 1 31 5 8.68s <NULL> <Rprofmem[,3] [11 × 3]> <bch:tm [36]> <tibble [36 × 3]>
#R> 7 C++ interpolation 1.13 1.08 4.61 1 3.07 105 15 8.49s <NULL> <Rprofmem[,3] [1 × 3]> <bch:tm [120]> <tibble [120 × 3]>
There is a five fold reduction in the computation time using six threads on my six core machine. Clearly, though, it does not scale linearly. Moreover, the answer I cited earlier is not yielding that big of a reduction with 70 million variables and it is not that precise (I hope that I did not make an error). The new C++ version from fastncdf is almost as fast as using six threads with the Rcpp solution.

Program Help - Solving for e(n)

I've been wrestling with this issue for a week and I just need some guidance on the math part of it. If I could just understand the math behind it I could piece together the functions to make it work. The assignment is;
Design and develop a C++ program for Calculating e(n) when delta <= 0.000001
e(n-1) = 1 + 1/1! + 1/2! + 1/3! + 1/4! + … + 1/(n-1)!
e(n) = 1 + 1/1! + 1/2! + 1/3! + 1/4! + … + 1/(n)!
delta = e(n) – e(n-1)
You do not have any input to the program. Your output should be something like this:
N = 2 e(1) = 2 e(2) = 2.5 delta = 0.5
N = 3 e(2) = 2.5 e(3) = 2.565 delta = 0.065
...
You must use recursive function calls.
My first issue is the math and the variables that would contain them.
the delta, e(n), and e(n-1) variable must doubles
if e(n) = 1 + 1 / 1! = 2 then e(n-1) must equal 1, which means delta = 1 (that's my thinking anyway) I'm just not sure of the math behind the .5 delta the first time and the 0.065 in the second iteration.
Can someone point me in the right direction on this problem?
Thank you,
T
From the wikipedia link, you can see that
I will not explain the notion of limits here, but what this basically means is that, if we define a function e where e(n) = 1 + 1/1! + 1/2! + 1/3! + 1/4! + … + 1/(n)! (which is the function given in your problem), we are able to approximate the real value of the constant e.
The higher n is, the closer we get from e.
If you look closely at the function, you can see that each time, we add a term which is smaller than the previous one: 1 >= 1/1! >= 1/2! >= .... >= 1/(n)!
That basically means that, every time we increase n we are getting closer to e but we are slowing down in the way.
The real value of e is 2.71828...
In our first step e(1) = 1, we are 1.71828... too far from the real value
In the second step e(2) = 2, we are at 0.71828..., 1 distance closer
In the third step e(3) = 2.5, we are now at 0.21828..., 0.5 distance closer
As you can see, we are getting there, but the closer we get, the slower we move. Now let's say that at each step, we want to know how close we have moved compared to the previous value.
We then do simply e(n) - e(n-1). This is basically what the delta means.
At some point, we are moving so slow that it does no longer make any sense to keep going. We are almost staying put. At this point, we decide that our approximation is close enough from e.
In your case, the problem defines the minimum progression speed to 0.000001
here is a solution :-
delta = e(n) - e(n-1)
delta = 1/n!
delta < 0.000001
n! > 1000000
n >= 10 as 10! = 3628800