Count multiple z-scores by two groups with Dplyr mutate Dapply/Lapply - apply

I'm trying to count Z-scores for multiple variables by two groups.
Here's an example:
data = mtcars
The variables I want to get the Z-scores:
vars <- c("mpg", "disp", "hp", "drat", "wt", "qsec")
Counting z-score for one variable (working):
mtcars %>%
group_by(am, vs) %>%
mutate(z_mpg = (mpg - mean(mpg)) / sd(mpg))
The problem is I can't get dapply or lapply working on previous code to run all of the "vars"-variables through, so I'd get all Z-scores at once.
If you have an idea how to do this with normalising data (mean 0, SD 1) while taking the groups in account, instead of z-scoring, that would help me also.
Thanks!

You would use mutate_at and use funs to define your z-score function. In this case it's using . to indicate the column you are mutating.
mtcars %>%
group_by(am, vs) %>%
mutate_at(.cols = vars, funs(z = (. - mean(.)) / sd(.)))
Source: local data frame [32 x 17]
Groups: am, vs [4]
mpg cyl disp hp drat wt qsec vs am gear carb mpg_z_ disp_z_ hp_z_ drat_z_ wt_z_ qsec_z_
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 0.3118089 -0.4852978 -0.7168218 -0.1024154 -0.48795905 0.60787578
2 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 0.3118089 -0.4852978 -0.7168218 -0.1024154 0.03595488 1.12105734
3 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 -1.1710339 0.9679756 0.5147599 -0.7890520 0.66286051 -0.09519147
4 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1 0.2659345 1.6870444 0.3753676 -1.0547191 0.05956492 -0.36000354
5 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2 1.3156017 0.0331832 -0.5745432 0.1266602 -0.86434641 -0.15281032
6 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1 -1.0695190 1.0153670 0.1364973 -1.7435153 0.76407410 0.17268463
7 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4 -0.2703291 0.0331832 1.5237884 0.3872184 -0.69514319 -1.62477907
8 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2 1.4799831 -0.5783405 -1.9177872 0.2582986 -0.01232378 0.02243925
9 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2 0.8324905 -0.6984282 -0.3412433 0.7533708 -0.12734568 2.00294653
10 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4 -0.6243679 -0.1529447 0.9964304 0.7533708 0.70656315 -1.13854778
# ... with 22 more rows

Related

How to plot shade red according to ratio variable using sgpanel plot

I would like to plot dataset and obtain desired output with the right setup.
Plot the scatter such that the points are in shade red-color, from light red to dark red depending on the scale (ratio) of 0-1 (0=light red, 1=dark red).
Show the legend also showing the scale red color according to the ration 0-1 (point 1.)
Data explanation:
area - city (shortcut)
id - user id
var - variable
time - datetime
exit - consumer left
ratio - proportion (between 0-1)
Data sample and attempt plotting (obviously not correct):
data data;
input area $ id $ var $ time $ exit $ ratio $;
datalines;
A 1 1 1 0 0.18
A 1 1 2 0 0.11
A 2 1 1 1 0.14
A 2 1 2 0 0.15
A 2 1 3 0 0.14
A 3 1 1 0 0.17
A 3 1 2 0 0.19
A 3 1 3 1 0.21
A 3 1 4 0 0.14
B 4 2 1 0 0.14
B 4 2 2 1 0.15
B 5 2 1 0 0.17
B 5 2 2 0 0.25
B 5 2 3 0 0.31
A 1 3 1 0 0.22
A 1 3 2 0 0.13
A 2 3 1 1 0.16
A 2 3 2 0 0.11
A 2 3 3 0 0.22
A 3 3 1 0 0.27
A 3 3 2 0 0.29
A 3 3 3 1 0.31
A 3 3 4 0 0.24
B 4 4 1 0 0.24
B 4 4 2 1 0.35
B 5 4 1 0 0.47
B 5 4 2 0 0.15
B 5 4 3 0 0.21
;;
run;
data attrs;
input id $ risk $ fillcolor $;
datalines;
ratio 0.05 Verylightred
ratio 0.15 Lightred
ratio 0.20 Red
ratio 0.25 Darkred
ratio 0.30 Verydarkred
ratio 0.35 Verydarkstrongred
;
run;
proc sgpanel data=data dattrmap=attrs;
panelby area exit;
scatter y=id x=var / markerattrs = (symbol = squarefilled) group=ratio attrid=ratio;
run;
This will get you closer.
Ratio should be numeric to be graphed
Ratio is continuous, how should it be used to group?
For the colour on the data attribute map, the length of the colours is not long enough and risk should be numeric
I don't know exactly how to specify the ranges you'd like for the colours you'd like but this gets you closer using the automatic legend.
One way to get at this is to add the variable to the data set for each group and then you can control the colour of each group with the data attribute map. This would mean adding a column in the 'data' data set called ratio_group whcih maps to the values in the data attribute map table. Use that variable the group.
data data;
input area $ id $ var $ time $ exit $ ratio ;
datalines;
A 1 1 1 0 0.18
A 1 1 2 0 0.11
A 2 1 1 1 0.14
A 2 1 2 0 0.15
A 2 1 3 0 0.14
A 3 1 1 0 0.17
A 3 1 2 0 0.19
A 3 1 3 1 0.21
A 3 1 4 0 0.14
B 4 2 1 0 0.14
B 4 2 2 1 0.15
B 5 2 1 0 0.17
B 5 2 2 0 0.25
B 5 2 3 0 0.31
A 1 3 1 0 0.22
A 1 3 2 0 0.13
A 2 3 1 1 0.16
A 2 3 2 0 0.11
A 2 3 3 0 0.22
A 3 3 1 0 0.27
A 3 3 2 0 0.29
A 3 3 3 1 0.31
A 3 3 4 0 0.24
B 4 4 1 0 0.24
B 4 4 2 1 0.35
B 5 4 1 0 0.47
B 5 4 2 0 0.15
B 5 4 3 0 0.21
;;
run;
proc sgpanel data=data ;
panelby area exit;
scatter y=id x=var / markerattrs = (symbol = squarefilled size=10)
colorresponse=ratio
colormodel=(verylightred lightred red darkred verydarkred verydarkstrongred);
colaxis grid minorgrid;
rowaxis grid minorgrid;
run;
For marker size look at the SIZE option under the MARKERATTRS option.
For grids, look at the GRID/MINORGRID options under the COLAXIS and ROWAXIS statements.
COLAXIS documentation

Is there and R fucntion to mutate a variable on multiple columns conditions using If else or

I am trying to create a variable in my data based on following conditions:
x y Z S T G
1 0 1 0 1 0
1 0 0 0 0 0
1 1 1 0 0 0
1 1 1 1 1 1
if x=1 then 1,
if y=1 then 2 if s=1 then 3,
if t=1 then 4 if G=1 then 5 if X==y==z==1 then 6 and so on.
Please tell me how can i write this using if else
Using if else?
You can calculate it without if else:
v <- 1:6
# this vector should give each column a the value
# 1 2 3 ... 6
# the most tedious part is to get your notes into a the R terminal
# as an R matrix.
# I used the fact that the string in R can span multiple lines:
s <- "x y Z S T G
1 0 1 0 1 0
1 0 0 0 0 0
1 1 1 0 0 0
1 1 1 1 1 1"
# it looks like this:
s
## [1] "x y Z S T G\n1 0 1 0 1 0\n1 0 0 0 0 0 \n1 1 1 0 0 0 \n1 1 1 1 1 1"
# after trying long around with the base R functions
# which led to errors and diverse problems, I found the most elegant way
# to transform this string into a matrix-like tabular form
# is to use tidyverse's read_delim().
# install.packages("tidyverse")
# load tidyverse:
require(tidyverse) # or: library(tidyverse)
tb <- read_delim(s, delim=" ") ## it complains about parsing failues, but
tb
# A tibble: 4 x 6
x y Z S T G
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 0 1 0 1 0
2 1 0 0 0 0 0
3 1 1 1 0 0 0
4 1 1 1 1 1 1
# so it is read correctly in!
# what you want to do actually is
# to multiply each row with `v` and sum this result:
tb[1, ]
# A tibble: 1 x 6
x y Z S T G
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 0 1 0 1 0
# you do:
v * tb[1, ]
x y Z S T G
1 1 0 3 0 5 0
# if you build sum with this, then you get your desired numbers
sum(v * tb[1, ])
## [1] 9
# row-wise manipulation of matrix/data.frame/tibbles you do by
apply(tb, MARGIN=1, FUN=function(row) v * row)
[,1] [,2] [,3] [,4]
x 1 1 1 1
y 0 0 2 2
Z 3 0 3 3
S 0 0 0 4
T 5 0 0 5
G 0 0 0 6
# very often such functions flip the results, so flip it back
# by the transpose function `t()`:
t(apply(tb, MARGIN=1, FUN=function(row) v * row))
x y Z S T G
[1,] 1 0 3 0 5 0
[2,] 1 0 0 0 0 0
[3,] 1 2 3 0 0 0
[4,] 1 2 3 4 5 6
# to get directly the sum by row, do:
apply(tb, MARGIN=1, FUN=function(row) sum(v * row))
## [1] 9 1 6 21
# these are the values you wanted, isn't it?
# I see now, that
tb * v # by using vectorization of R
x y Z S T G
[1,] 1 0 3 0 5 0
[2,] 1 0 0 0 0 0
[3,] 1 2 3 0 0 0
[4,] 1 2 3 4 5 6
# therfore the rowSums are:
rowSums(tb * v)
## [1] 9 1 6 21
So this is the usual (messy) way how one often gets to the solution.
At the end, it boils down to this (and usually you find in Stack Overflow short answers like):
Short answer
require(tidyverse)
s <- "x y Z S T G
1 0 1 0 1 0
1 0 0 0 0 0
1 1 1 0 0 0
1 1 1 1 1 1"
tb <- read_delim(s, delim=" ")
rowSums(tb * v)
And this is the beauty of R: If you know exactly what to do, it is just 1-3 lines of code (or a little more) ...

Incorrect reading of a variable from a txt (Fortran)

I'm trying to read this txt:
Fecha dia mes ano hora min
03/06/2016 00:00 3 6 2016 0 0
03/06/2016 00:05 3 6 2016 0 5
03/06/2016 00:10 3 6 2016 0 10
03/06/2016 00:15 3 6 2016 0 15
03/06/2016 00:20 3 6 2016 0 20
03/06/2016 00:25 3 6 2016 0 25
03/06/2016 00:30 3 6 2016 0 30
03/06/2016 00:35 3 6 2016 0 35
03/06/2016 00:40 3 6 2016 0 40
03/06/2016 00:45 3 6 2016 0 45
03/06/2016 00:50 3 6 2016 0 50
03/06/2016 00:55 3 6 2016 0 55
03/06/2016 01:00 3 6 2016 1 0
With the following code:
program fecha
implicit none
integer, dimension(13):: dia, mes, ano, hora, minuto
character*50 :: formato = '(11x,5x,1x,i1,1x,i1,1x,i4,1x,i1,1x,i2)'
open (unit = 10, file = 'datos.txt')
read(10,*)
read(unit = 10, fmt = formato) dia, mes, ano, hora, minuto
write(*,*) dia
close(10)
end program
Why this code read 'dia' in this way:
3 6 2016 0 0 3 6 2016 0 5 3 6 2016
(I know how it's reading but not why)
You need to skip two lines at the beginning as well as reading the values line by line.
The following example is a slight modification of your program which runs smoothly.
program fecha
implicit none
integer :: i, iounit
integer, parameter :: n = 13
integer, dimension(n) :: dia, mes, ano, hora, minuto
open (newunit = iounit, file = 'datos.txt')
read (iounit, *)
read (iounit, *)
do i = 1, n
read (unit = iounit, fmt = '(16x, i5, i4, i7, 2i5)') dia(i), mes(i), ano(i), hora(i), minuto(i)
print *, dia(i), mes(i), ano(i), hora(i), minuto(i)
end do
close (iounit)
end program
My output is
$ gfortran -g3 -Wall -fcheck=all a.f90 && ./a.out
3 6 2016 0 0
3 6 2016 0 5
3 6 2016 0 10
3 6 2016 0 15
3 6 2016 0 20
3 6 2016 0 25
3 6 2016 0 30
3 6 2016 0 35
3 6 2016 0 40
3 6 2016 0 45
3 6 2016 0 50
3 6 2016 0 55
3 6 2016 1 0

How to add dummy row based on one column in pandas dataframe?

I'm working with pandas,So basically i've two dataframes and the number of rows are different in both the cases:
df
wave num stlines fwhm EWs MeasredWave
0 4050.32 3 0.28269 0.07365 22.16080 4050.311360
1 4208.98 5 0.48122 0.08765 44.90035 4208.972962
2 4374.94 9 0.71483 0.11429 86.96497 4374.927110
3 4379.74 9 0.31404 0.09107 30.44271 4379.760601
4 4398.01 14 0.50415 0.09845 52.83236 4398.007473
5 5520.50 1 0.06148 0.12556 8.21685 5520.484742
df1
wave num stlines fwhm EWs MeasredWave
0 4050.32 3 0.28616 0.07521 22.91064 4050.327388
1 4208.98 6 0.48781 0.08573 44.51609 4208.990029
2 4374.94 9 0.71548 0.11437 87.10152 4374.944513
3 4379.74 10 0.31338 0.09098 30.34791 4379.778009
4 4398.01 15 0.49950 0.08612 45.78707 4398.020367
5 4502.21 9 0.56362 0.10114 60.67868 4502.223123
6 4508.28 3 0.69554 0.11600 85.88428 4508.291777
7 4512.99 2 0.20486 0.08891 19.38745 4512.999332
8 5520.50 1 0.06148 0.12556 8.21685 5520.484742
So there are some rows in df1 that are not in df. So i want to add the row to the dataframe and reset the index accordingly. Previously i was just removing the extra rows from the dataframe to keep them equal but now i just want to add an empty row of the index of column isn't there.
The desired result should look like this,
wave num stlines fwhm EWs MeasredWave
0 4050.32 3 0.28269 0.07365 22.16080 4050.311360
1 4208.98 5 0.48122 0.08765 44.90035 4208.972962
2 4374.94 9 0.71483 0.11429 86.96497 4374.927110
3 4379.74 9 0.31404 0.09107 30.44271 4379.760601
4 4398.01 14 0.50415 0.09845 52.83236 4398.007473
5 4502.21 0 0 0 0 0
6 4508.28 0 0 0 0 0
7 4512.99 0 0 0 0 0
8 5520.50 1 0.06148 0.12556 8.21685 5520.484742
How can i get this?
IIUC, you can use DataFrame.loc to update the values of df1 where wave doesnt exist in df:
df1.loc[~df1.wave.isin(df.wave), 'num':] = 0
Then use DataFrame.combine_first to make sure that the values in df take precedence:
df_out = df.set_index('wave').combine_first(df1.set_index('wave')).reset_index()
[out]
print(df_out)
wave num stlines fwhm EWs MeasredWave
0 4050.32 3.0 0.28269 0.07365 22.16080 4050.311360
1 4208.98 5.0 0.48122 0.08765 44.90035 4208.972962
2 4374.94 9.0 0.71483 0.11429 86.96497 4374.927110
3 4379.74 9.0 0.31404 0.09107 30.44271 4379.760601
4 4398.01 14.0 0.50415 0.09845 52.83236 4398.007473
5 4502.21 0.0 0.00000 0.00000 0.00000 0.000000
6 4508.28 0.0 0.00000 0.00000 0.00000 0.000000
7 4512.99 0.0 0.00000 0.00000 0.00000 0.000000
8 5520.50 1.0 0.06148 0.12556 8.21685 5520.484742

constructing a Data Frame in Rcpp

I want to construct a data frame in an Rcpp function, but when I get it, it doesn't really look like a data frame. I've tried pushing vectors etc. but it leads to the same thing. Consider:
RcppExport SEXP makeDataFrame(SEXP in) {
Rcpp::DataFrame dfin(in);
Rcpp::DataFrame dfout;
for (int i=0;i<dfin.length();i++) {
dfout.push_back(dfin(i));
}
return dfout;
}
in R:
> .Call("makeDataFrame",mtcars,"myPkg")
[[1]]
[1] 21.0 21.0 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 17.8 16.4 17.3 15.2 10.4
[16] 10.4 14.7 32.4 30.4 33.9 21.5 15.5 15.2 13.3 19.2 27.3 26.0 30.4 15.8 19.7
[31] 15.0 21.4
[[2]]
[1] 6 6 4 6 8 6 8 4 4 6 6 8 8 8 8 8 8 4 4 4 4 8 8 8 8 4 4 4 8 6 8 4
[[3]]
[1] 160.0 160.0 108.0 258.0 360.0 225.0 360.0 146.7 140.8 167.6 167.6 275.8
[13] 275.8 275.8 472.0 460.0 440.0 78.7 75.7 71.1 120.1 318.0 304.0 350.0
[25] 400.0 79.0 120.3 95.1 351.0 145.0 301.0 121.0
[[4]]
[1] 110 110 93 110 175 105 245 62 95 123 123 180 180 180 205 215 230 66 52
[20] 65 97 150 150 245 175 66 91 113 264 175 335 109
[[5]]
[1] 3.90 3.90 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 3.92 3.07 3.07 3.07 2.93
[16] 3.00 3.23 4.08 4.93 4.22 3.70 2.76 3.15 3.73 3.08 4.08 4.43 3.77 4.22 3.62
[31] 3.54 4.11
[[6]]
[1] 2.620 2.875 2.320 3.215 3.440 3.460 3.570 3.190 3.150 3.440 3.440 4.070
[13] 3.730 3.780 5.250 5.424 5.345 2.200 1.615 1.835 2.465 3.520 3.435 3.840
[25] 3.845 1.935 2.140 1.513 3.170 2.770 3.570 2.780
[[7]]
[1] 16.46 17.02 18.61 19.44 17.02 20.22 15.84 20.00 22.90 18.30 18.90 17.40
[13] 17.60 18.00 17.98 17.82 17.42 19.47 18.52 19.90 20.01 16.87 17.30 15.41
[25] 17.05 18.90 16.70 16.90 14.50 15.50 14.60 18.60
[[8]]
[1] 0 0 1 1 0 1 0 1 1 1 1 0 0 0 0 0 0 1 1 1 1 0 0 0 0 1 0 1 0 0 0 1
[[9]]
[1] 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 1 1 1 1 1 1 1
[[10]]
[1] 4 4 4 3 3 3 3 4 4 4 4 3 3 3 3 3 3 4 4 4 3 3 3 3 3 4 5 5 5 5 5 4
[[11]]
[1] 4 4 1 1 2 1 4 2 2 4 4 3 3 3 4 4 4 1 2 1 1 2 2 4 2 1 2 2 4 6 8 2
Briefly:
DataFrames are indeed just like lists with the added restriction of having to have a common length, so they are best constructed column by column.
The best way is often to look for our unit tests. Her inst/unitTests/runit.DataFrame.R
regroups tests for the DataFrame class.
You also found the .push_back() member function in Rcpp which we added for convenience and analogy with the STL. We do warn that it is not recommended: due to differences with the way R objects are constructed, we essentially always need to do full copies .push_back is not very efficient.
Despite me answering here frequently, the rcpp-devel list a better place for Rcpp questions.
It seems Rcpp can return a proper data.frame, provided you supply the names explicitely. I'm not sure how to adapt this to your example with arbitrary names
mkdf <- '
Rcpp::DataFrame dfin(input);
Rcpp::DataFrame dfout;
for (int i=0;i<dfin.length();i++) {
dfout.push_back(dfin(i));
}
return Rcpp::DataFrame::create( Named("x")= dfout(1), Named("y") = dfout(2));
'
library(inline)
test <- cxxfunction( signature(input="data.frame"),
mkdf, plugin="Rcpp")
test(input=head(iris))
Using the information from #baptiste's answer, this is what finally does give a well formed data frame:
RcppExport SEXP makeDataFrame(SEXP in) {
Rcpp::DataFrame dfin(in);
Rcpp::DataFrame dfout;
Rcpp::CharacterVector namevec;
std::string namestem = "Column Heading ";
for (int i=0;i<2;i++) {
dfout.push_back(dfin(i));
namevec.push_back(namestem+std::string(1,(char)(((int)'a') + i)));
}
dfout.attr("names") = namevec;
Rcpp::DataFrame x;
Rcpp::Language call("as.data.frame",dfout);
x = call.eval();
return x;
}
I think the point remains that this might be inefficient due to push_back (as suggested by #Dirk) and the second Language call evaluation. I looked up the rcpp unitTests, and haven't been able to come up with something better yet. Anybody have any ideas?
Update:
Using #Dirk's suggestions (thanks!), this seems to be a simpler, efficient solution:
RcppExport SEXP makeDataFrame(SEXP in) {
Rcpp::DataFrame dfin(in);
Rcpp::List myList(dfin.length());
Rcpp::CharacterVector namevec;
std::string namestem = "Column Heading ";
for (int i=0;i<dfin.length();i++) {
myList[i] = dfin(i); // adding vectors
namevec.push_back(namestem+std::string(1,(char)(((int)'a') + i))); // making up column names
}
myList.attr("names") = namevec;
Rcpp::DataFrame dfout(myList);
return dfout;
}
I concur with joran. The output of a C function called from within R is a list of all its arguments, both "in" and "out", so each "column" of the dataframe could be represented in the C function call as an argument. Once the result of the C function call is in R, all that remains to be done is to extract those list elements using list indexing and give them the appropriate names.