I create the following algorithm in Rcpp and compile it in R.
// [[Rcpp::depends(RcppArmadillo)]]
#include <RcppArmadilloExtensions/sample.h>
// [[Rcpp::export]]
arma::colvec Demo(arma::mat n, int K){
arma::colvec N(K);
for(int j=0; j<K; ++j){
for(int i=0; i<(K-j); ++i){
N[j] += accu(n.submat(i,0,i,j));
}
}
return N;
}
/***R
K = 4
n = cbind(c(1008, 5112, 1026, 25, 0), 0, 0, 0, 0)
Demo(n,K)
for(i in 1:3){
print(Demo(n,K))
print(K)
print(n)
}
*/
However, something really weird happens when I run it inside a loop.
For example, if I have
> K = 4
> n
[,1] [,2] [,3] [,4] [,5]
[1,] 1008 0 0 0 0
[2,] 5112 0 0 0 0
[3,] 1026 0 0 0 0
[4,] 25 0 0 0 0
[5,] 0 0 0 0 0
Then if I run the algorithm Demo a single time I receive the correct result
> Demo(n,K)
[,1]
[1,] 7171
[2,] 7146
[3,] 6120
[4,] 1008
However, if I run it multiple times inside a loop, it starts to behave weird
for(i in 1:3){
print(Demo(n,K))
print(K)
print(n)
}
[,1]
[1,] 7171
[2,] 7146
[3,] 6120
[4,] 1008
[1] 4
[,1] [,2] [,3] [,4] [,5]
[1,] 1008 0 0 0 0
[2,] 5112 0 0 0 0
[3,] 1026 0 0 0 0
[4,] 25 0 0 0 0
[5,] 0 0 0 0 0
[,1]
[1,] 14342
[2,] 14292
[3,] 12240
[4,] 2016
[1] 4
[,1] [,2] [,3] [,4] [,5]
[1,] 1008 0 0 0 0
[2,] 5112 0 0 0 0
[3,] 1026 0 0 0 0
[4,] 25 0 0 0 0
[5,] 0 0 0 0 0
[,1]
[1,] 21513
[2,] 21438
[3,] 18360
[4,] 3024
[1] 4
[,1] [,2] [,3] [,4] [,5]
[1,] 1008 0 0 0 0
[2,] 5112 0 0 0 0
[3,] 1026 0 0 0 0
[4,] 25 0 0 0 0
[5,] 0 0 0 0 0
In the first run, it computes it correctly, then in the second run it gives the correct output multiplied by 2, and in the third run, it gives the correct output multiplied by 3. But based on the algorithm steps, I do not see an obvious step that produces this kind of behavior.
The correct output should have been
for(i in 1:3){
print(Demo(n,K))
}
[,1]
[1,] 7171
[2,] 7146
[3,] 6120
[4,] 1008
[,1]
[1,] 7171
[2,] 7146
[3,] 6120
[4,] 1008
[,1]
[1,] 7171
[2,] 7146
[3,] 6120
[4,] 1008
You are incrementing N in place via +=.
Your function fails to ensure it is initialized at zero. Rcpp tends to do that by default (as I think it is prudent) -- but this can be suppressed for speed if you know you are doing.
A minimally repaired version of your code (with the correct header, and a call to .fill(0)) follows.
// [[Rcpp::depends(RcppArmadillo)]]
#include <RcppArmadillo.h>
// [[Rcpp::export]]
arma::colvec Demo(arma::mat n, int K){
arma::colvec N(K);
N.fill(0); // important, or construct as N(k, arma::fill::zeros)
for(int j=0; j<K; ++j){
for(int i=0; i<(K-j); ++i){
N[j] += accu(n.submat(i,0,i,j));
}
}
return N;
}
/***R
K = 4
n = cbind(c(1008, 5112, 1026, 25, 0), 0, 0, 0, 0)
Demo(n,K)
for(i in 1:3) {
print(Demo(n,K))
print(K)
print(n)
}
*/
You could also call .zeros() (once constructed) or use zeros(k) (to construct) or ... deploy a number of different ways to ensure your content is cleared before adding to it.
The shortest, after checking the documentation, may be arma::colvec(N, arma::fill::zeros).
I'm working with pandas,So basically i've two dataframes and the number of rows are different in both the cases:
df
wave num stlines fwhm EWs MeasredWave
0 4050.32 3 0.28269 0.07365 22.16080 4050.311360
1 4208.98 5 0.48122 0.08765 44.90035 4208.972962
2 4374.94 9 0.71483 0.11429 86.96497 4374.927110
3 4379.74 9 0.31404 0.09107 30.44271 4379.760601
4 4398.01 14 0.50415 0.09845 52.83236 4398.007473
5 5520.50 1 0.06148 0.12556 8.21685 5520.484742
df1
wave num stlines fwhm EWs MeasredWave
0 4050.32 3 0.28616 0.07521 22.91064 4050.327388
1 4208.98 6 0.48781 0.08573 44.51609 4208.990029
2 4374.94 9 0.71548 0.11437 87.10152 4374.944513
3 4379.74 10 0.31338 0.09098 30.34791 4379.778009
4 4398.01 15 0.49950 0.08612 45.78707 4398.020367
5 4502.21 9 0.56362 0.10114 60.67868 4502.223123
6 4508.28 3 0.69554 0.11600 85.88428 4508.291777
7 4512.99 2 0.20486 0.08891 19.38745 4512.999332
8 5520.50 1 0.06148 0.12556 8.21685 5520.484742
So there are some rows in df1 that are not in df. So i want to add the row to the dataframe and reset the index accordingly. Previously i was just removing the extra rows from the dataframe to keep them equal but now i just want to add an empty row of the index of column isn't there.
The desired result should look like this,
wave num stlines fwhm EWs MeasredWave
0 4050.32 3 0.28269 0.07365 22.16080 4050.311360
1 4208.98 5 0.48122 0.08765 44.90035 4208.972962
2 4374.94 9 0.71483 0.11429 86.96497 4374.927110
3 4379.74 9 0.31404 0.09107 30.44271 4379.760601
4 4398.01 14 0.50415 0.09845 52.83236 4398.007473
5 4502.21 0 0 0 0 0
6 4508.28 0 0 0 0 0
7 4512.99 0 0 0 0 0
8 5520.50 1 0.06148 0.12556 8.21685 5520.484742
How can i get this?
IIUC, you can use DataFrame.loc to update the values of df1 where wave doesnt exist in df:
df1.loc[~df1.wave.isin(df.wave), 'num':] = 0
Then use DataFrame.combine_first to make sure that the values in df take precedence:
df_out = df.set_index('wave').combine_first(df1.set_index('wave')).reset_index()
[out]
print(df_out)
wave num stlines fwhm EWs MeasredWave
0 4050.32 3.0 0.28269 0.07365 22.16080 4050.311360
1 4208.98 5.0 0.48122 0.08765 44.90035 4208.972962
2 4374.94 9.0 0.71483 0.11429 86.96497 4374.927110
3 4379.74 9.0 0.31404 0.09107 30.44271 4379.760601
4 4398.01 14.0 0.50415 0.09845 52.83236 4398.007473
5 4502.21 0.0 0.00000 0.00000 0.00000 0.000000
6 4508.28 0.0 0.00000 0.00000 0.00000 0.000000
7 4512.99 0.0 0.00000 0.00000 0.00000 0.000000
8 5520.50 1.0 0.06148 0.12556 8.21685 5520.484742
I'm trying to count Z-scores for multiple variables by two groups.
Here's an example:
data = mtcars
The variables I want to get the Z-scores:
vars <- c("mpg", "disp", "hp", "drat", "wt", "qsec")
Counting z-score for one variable (working):
mtcars %>%
group_by(am, vs) %>%
mutate(z_mpg = (mpg - mean(mpg)) / sd(mpg))
The problem is I can't get dapply or lapply working on previous code to run all of the "vars"-variables through, so I'd get all Z-scores at once.
If you have an idea how to do this with normalising data (mean 0, SD 1) while taking the groups in account, instead of z-scoring, that would help me also.
Thanks!
You would use mutate_at and use funs to define your z-score function. In this case it's using . to indicate the column you are mutating.
mtcars %>%
group_by(am, vs) %>%
mutate_at(.cols = vars, funs(z = (. - mean(.)) / sd(.)))
Source: local data frame [32 x 17]
Groups: am, vs [4]
mpg cyl disp hp drat wt qsec vs am gear carb mpg_z_ disp_z_ hp_z_ drat_z_ wt_z_ qsec_z_
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 0.3118089 -0.4852978 -0.7168218 -0.1024154 -0.48795905 0.60787578
2 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 0.3118089 -0.4852978 -0.7168218 -0.1024154 0.03595488 1.12105734
3 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 -1.1710339 0.9679756 0.5147599 -0.7890520 0.66286051 -0.09519147
4 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1 0.2659345 1.6870444 0.3753676 -1.0547191 0.05956492 -0.36000354
5 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2 1.3156017 0.0331832 -0.5745432 0.1266602 -0.86434641 -0.15281032
6 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1 -1.0695190 1.0153670 0.1364973 -1.7435153 0.76407410 0.17268463
7 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4 -0.2703291 0.0331832 1.5237884 0.3872184 -0.69514319 -1.62477907
8 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2 1.4799831 -0.5783405 -1.9177872 0.2582986 -0.01232378 0.02243925
9 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2 0.8324905 -0.6984282 -0.3412433 0.7533708 -0.12734568 2.00294653
10 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4 -0.6243679 -0.1529447 0.9964304 0.7533708 0.70656315 -1.13854778
# ... with 22 more rows
How do i convert my list output to a data frame? below is a sample of the code and data
import pandas as pd
import numpy as np
from datetime import datetime
dat=pd.read_csv()
dat.Date = dat.Date.apply(lambda d: datetime.strptime(d, "%d-%m-%Y"))
dat.index = dat.Date
dat = dat.drop(['Date'], axis=1)
################################################################
#Provide Input parameters
Decay=0.4
Decay_Dur=15 #(in days)
Return_Avg_Dur=15 #(in days)
################################################################
Weights=[]
Weights=[pow(i,((2*Decay)-1)) for i in range(1,Decay_Dur+1)] # Calculate Weights
Weights=Weights[::-1] #Reverse the order
fin_dat=[0]
for j in range(1,(dat.shape[0]-Decay_Dur)):
Sum_Weighted_Index=0
for i in range(j,Decay_Dur+j):
temp=Weights[i-j]*dat.iat[i-1,2] #
Sum_Weighted_Index+=temp
fin_dat.append(Sum_Weighted_Index)
Date SPX Index Surprise Index S&P 500 Daily Return
19-07-2007 1553.08 -0.0563 0.0045
20-07-2007 1534.1 0 -0.0122
23-07-2007 1541.57 0 0.0049
24-07-2007 1511.04 0 -0.0198
25-07-2007 1518.09 0 0.0047
26-07-2007 1482.66 0 -0.0233
27-07-2007 1458.95 0 -0.016
30-07-2007 1473.91 0 0.0103
31-07-2007 1455.27 -0.0867 -0.0126
01-08-2007 1465.81 -0.1529 0.0072
02-08-2007 1472.2 0 0.0044
03-08-2007 1433.06 -0.0848 -0.0266
06-08-2007 1467.67 0 0.0242
07-08-2007 1476.71 0 0.0062
08-08-2007 1497.49 0 0.0141
09-08-2007 1453.09 0 -0.0296
10-08-2007 1453.64 0 0.0004
13-08-2007 1452.92 0.0138 -0.0005
14-08-2007 1426.54 0 -0.0182
15-08-2007 1406.7 0 -0.0139
16-08-2007 1411.27 -0.1289 0.0032
17-08-2007 1445.94 0 0.0246
20-08-2007 1445.55 0 -0.0003
21-08-2007 1447.12 0 0.0011
22-08-2007 1464.07 0 0.0117
23-08-2007 1462.5 0 -0.0011
24-08-2007 1479.37 0 0.0115
27-08-2007 1466.79 0 -0.0085
I tried to use your code and then create new version using pandas functions.
It's all my "notes" - and result at the end.
Check whether the results are correct.
import pandas as pd
#--- generate some data ---
#dates = pd.date_range( '01-01-2010', periods=30, freq='D' )
#values = range(0,30)
#dat = pd.DataFrame( {'Date':dates, 'val1':values, 'val2':values} )
#dat.index = dat.Date
#print dat
data = '''Date SPX Surprise S&P-500
19-07-2007 1553.08 -0.0563 0.0045
20-07-2007 1534.1 0 -0.0122
23-07-2007 1541.57 0 0.0049
24-07-2007 1511.04 0 -0.0198
25-07-2007 1518.09 0 0.0047
26-07-2007 1482.66 0 -0.0233
27-07-2007 1458.95 0 -0.016
30-07-2007 1473.91 0 0.0103
31-07-2007 1455.27 -0.0867 -0.0126
01-08-2007 1465.81 -0.1529 0.0072
02-08-2007 1472.2 0 0.0044
03-08-2007 1433.06 -0.0848 -0.0266
06-08-2007 1467.67 0 0.0242
07-08-2007 1476.71 0 0.0062
08-08-2007 1497.49 0 0.0141
09-08-2007 1453.09 0 -0.0296
10-08-2007 1453.64 0 0.0004
13-08-2007 1452.92 0.0138 -0.0005
14-08-2007 1426.54 0 -0.0182
15-08-2007 1406.7 0 -0.0139
16-08-2007 1411.27 -0.1289 0.0032
17-08-2007 1445.94 0 0.0246
20-08-2007 1445.55 0 -0.0003
21-08-2007 1447.12 0 0.0011
22-08-2007 1464.07 0 0.0117
23-08-2007 1462.5 0 -0.0011
24-08-2007 1479.37 0 0.0115
27-08-2007 1466.79 0 -0.0085'''
from StringIO import StringIO
dat = pd.DataFrame.from_csv( StringIO(data), sep='\s+')
#------------------------------------------
decay = 0.4
decay_dur = 15 # (in days)
return_avg_dur = 15 # (in days)
#--- old version ---
weights = [ pow(i,(2*decay)-1) for i in range(1,decay_dur+1) ] # Calculate Weights
weights = weights[::-1] #Reverse the order
#weights = [ pow(i,(2*decay)-1) for i in range(1,decay_dur+1) ][::-1]
#fin_dat=[0]
dat['old'] = 0.0
for j in range(1,(dat.shape[0]-decay_dur)):
sum_weighted_index = 0
for i in range(j,decay_dur+j):
#sum_weighted_index += weights[i-j] * dat.iat[i-1,2] #
sum_weighted_index += weights[i-j] * dat['S&P-500'].iat[i-1] #
#fin_dat.append(sum_weighted_index)
dat['old'].iat[j] = sum_weighted_index
#print sum_weighted_index
#--- new version ---
#def sum_weighted_index(data):
# result = 0
# for w, d in zip(weights, data):
# result += w * d
# return result
def sum_weighted_index(data):
return sum( w * d for w, d in zip(weights, data) )
dat['new'] = pd.rolling_apply(dat['S&P-500'], decay_dur, sum_weighted_index).shift(-decay_dur+2).fillna(0)
print dat
result
SPX Surprise S&P-500 old new
Date
2007-07-19 1553.08 -0.0563 0.0045 0.000000 0.000000
2007-07-20 1534.10 0.0000 -0.0122 -0.010550 -0.010550
2007-07-23 1541.57 0.0000 0.0049 -0.044731 -0.044731
2007-07-24 1511.04 0.0000 -0.0198 -0.034384 -0.034384
2007-07-25 1518.09 0.0000 0.0047 -0.036309 -0.036309
2007-07-26 1482.66 0.0000 -0.0233 -0.042091 -0.042091
2007-07-27 1458.95 0.0000 -0.0160 -0.055676 -0.055676
2007-07-30 1473.91 0.0000 0.0103 -0.035502 -0.035502
2007-07-31 1455.27 -0.0867 -0.0126 -0.000058 -0.000058
2007-01-08 1465.81 -0.1529 0.0072 -0.008301 -0.008301
2007-02-08 1472.20 0.0000 0.0044 -0.000615 -0.000615
2007-03-08 1433.06 -0.0848 -0.0266 0.006442 0.006442
2007-06-08 1467.67 0.0000 0.0242 0.001076 0.001076
2007-07-08 1476.71 0.0000 0.0062 0.000000 0.027115
2007-08-08 1497.49 0.0000 0.0141 0.000000 0.002560
2007-09-08 1453.09 0.0000 -0.0296 0.000000 0.000000
2007-10-08 1453.64 0.0000 0.0004 0.000000 0.000000
2007-08-13 1452.92 0.0138 -0.0005 0.000000 0.000000
2007-08-14 1426.54 0.0000 -0.0182 0.000000 0.000000
2007-08-15 1406.70 0.0000 -0.0139 0.000000 0.000000
2007-08-16 1411.27 -0.1289 0.0032 0.000000 0.000000
2007-08-17 1445.94 0.0000 0.0246 0.000000 0.000000
2007-08-20 1445.55 0.0000 -0.0003 0.000000 0.000000
2007-08-21 1447.12 0.0000 0.0011 0.000000 0.000000
2007-08-22 1464.07 0.0000 0.0117 0.000000 0.000000
2007-08-23 1462.50 0.0000 -0.0011 0.000000 0.000000
2007-08-24 1479.37 0.0000 0.0115 0.000000 0.000000
2007-08-27 1466.79 0.0000 -0.0085 0.000000 0.000000
I want to construct a data frame in an Rcpp function, but when I get it, it doesn't really look like a data frame. I've tried pushing vectors etc. but it leads to the same thing. Consider:
RcppExport SEXP makeDataFrame(SEXP in) {
Rcpp::DataFrame dfin(in);
Rcpp::DataFrame dfout;
for (int i=0;i<dfin.length();i++) {
dfout.push_back(dfin(i));
}
return dfout;
}
in R:
> .Call("makeDataFrame",mtcars,"myPkg")
[[1]]
[1] 21.0 21.0 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 17.8 16.4 17.3 15.2 10.4
[16] 10.4 14.7 32.4 30.4 33.9 21.5 15.5 15.2 13.3 19.2 27.3 26.0 30.4 15.8 19.7
[31] 15.0 21.4
[[2]]
[1] 6 6 4 6 8 6 8 4 4 6 6 8 8 8 8 8 8 4 4 4 4 8 8 8 8 4 4 4 8 6 8 4
[[3]]
[1] 160.0 160.0 108.0 258.0 360.0 225.0 360.0 146.7 140.8 167.6 167.6 275.8
[13] 275.8 275.8 472.0 460.0 440.0 78.7 75.7 71.1 120.1 318.0 304.0 350.0
[25] 400.0 79.0 120.3 95.1 351.0 145.0 301.0 121.0
[[4]]
[1] 110 110 93 110 175 105 245 62 95 123 123 180 180 180 205 215 230 66 52
[20] 65 97 150 150 245 175 66 91 113 264 175 335 109
[[5]]
[1] 3.90 3.90 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 3.92 3.07 3.07 3.07 2.93
[16] 3.00 3.23 4.08 4.93 4.22 3.70 2.76 3.15 3.73 3.08 4.08 4.43 3.77 4.22 3.62
[31] 3.54 4.11
[[6]]
[1] 2.620 2.875 2.320 3.215 3.440 3.460 3.570 3.190 3.150 3.440 3.440 4.070
[13] 3.730 3.780 5.250 5.424 5.345 2.200 1.615 1.835 2.465 3.520 3.435 3.840
[25] 3.845 1.935 2.140 1.513 3.170 2.770 3.570 2.780
[[7]]
[1] 16.46 17.02 18.61 19.44 17.02 20.22 15.84 20.00 22.90 18.30 18.90 17.40
[13] 17.60 18.00 17.98 17.82 17.42 19.47 18.52 19.90 20.01 16.87 17.30 15.41
[25] 17.05 18.90 16.70 16.90 14.50 15.50 14.60 18.60
[[8]]
[1] 0 0 1 1 0 1 0 1 1 1 1 0 0 0 0 0 0 1 1 1 1 0 0 0 0 1 0 1 0 0 0 1
[[9]]
[1] 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 1 1 1 1 1 1 1
[[10]]
[1] 4 4 4 3 3 3 3 4 4 4 4 3 3 3 3 3 3 4 4 4 3 3 3 3 3 4 5 5 5 5 5 4
[[11]]
[1] 4 4 1 1 2 1 4 2 2 4 4 3 3 3 4 4 4 1 2 1 1 2 2 4 2 1 2 2 4 6 8 2
Briefly:
DataFrames are indeed just like lists with the added restriction of having to have a common length, so they are best constructed column by column.
The best way is often to look for our unit tests. Her inst/unitTests/runit.DataFrame.R
regroups tests for the DataFrame class.
You also found the .push_back() member function in Rcpp which we added for convenience and analogy with the STL. We do warn that it is not recommended: due to differences with the way R objects are constructed, we essentially always need to do full copies .push_back is not very efficient.
Despite me answering here frequently, the rcpp-devel list a better place for Rcpp questions.
It seems Rcpp can return a proper data.frame, provided you supply the names explicitely. I'm not sure how to adapt this to your example with arbitrary names
mkdf <- '
Rcpp::DataFrame dfin(input);
Rcpp::DataFrame dfout;
for (int i=0;i<dfin.length();i++) {
dfout.push_back(dfin(i));
}
return Rcpp::DataFrame::create( Named("x")= dfout(1), Named("y") = dfout(2));
'
library(inline)
test <- cxxfunction( signature(input="data.frame"),
mkdf, plugin="Rcpp")
test(input=head(iris))
Using the information from #baptiste's answer, this is what finally does give a well formed data frame:
RcppExport SEXP makeDataFrame(SEXP in) {
Rcpp::DataFrame dfin(in);
Rcpp::DataFrame dfout;
Rcpp::CharacterVector namevec;
std::string namestem = "Column Heading ";
for (int i=0;i<2;i++) {
dfout.push_back(dfin(i));
namevec.push_back(namestem+std::string(1,(char)(((int)'a') + i)));
}
dfout.attr("names") = namevec;
Rcpp::DataFrame x;
Rcpp::Language call("as.data.frame",dfout);
x = call.eval();
return x;
}
I think the point remains that this might be inefficient due to push_back (as suggested by #Dirk) and the second Language call evaluation. I looked up the rcpp unitTests, and haven't been able to come up with something better yet. Anybody have any ideas?
Update:
Using #Dirk's suggestions (thanks!), this seems to be a simpler, efficient solution:
RcppExport SEXP makeDataFrame(SEXP in) {
Rcpp::DataFrame dfin(in);
Rcpp::List myList(dfin.length());
Rcpp::CharacterVector namevec;
std::string namestem = "Column Heading ";
for (int i=0;i<dfin.length();i++) {
myList[i] = dfin(i); // adding vectors
namevec.push_back(namestem+std::string(1,(char)(((int)'a') + i))); // making up column names
}
myList.attr("names") = namevec;
Rcpp::DataFrame dfout(myList);
return dfout;
}
I concur with joran. The output of a C function called from within R is a list of all its arguments, both "in" and "out", so each "column" of the dataframe could be represented in the C function call as an argument. Once the result of the C function call is in R, all that remains to be done is to extract those list elements using list indexing and give them the appropriate names.