statistics on a list of data frames - list

I have a list of two data frames d$1 for ctrl patients, d$2 for sick patients. Each df contains microbes Relative abundance from 3 patients:
List of 2
$ CTRL :'data.frame': 3 obs. of 18107 variables:
..$ Azorhizobium caulinodans : num [1:3] 1.48e-07 1.62e-06 1.05e-06
..$ Buchnera aphidicola : num [1:3] 9.63e-07 1.01e-06 8.09e-07
..$ Cellulomonas gilvus : num [1:3] 1.63e-06 5.39e-07 4.05e-07
..$ Dictyoglomus thermophilum : num [1:3] 2.30e-06 3.17e-06 1.34e-06
..$ Pelobacter carbinolicus : num [1:3] 9.63e-07 3.70e-06 1.38e-06
..$ Shewanella colwelliana : num [1:3] 9.63e-07 1.89e-06 1.62e-07
..$ Myxococcus fulvus : num [1:3] 1.78e-06 4.65e-06 1.50e-06
$ SICK:'data.frame': 3 obs. of 18107 variables:
..$ Azorhizobium caulinodans : num [1:3] 4.24e-07 0.00 1.28e-06
..$ Buchnera aphidicola : num [1:3] 5.45e-07 6.02e-07 4.47e-07
..$ Cellulomonas gilvus : num [1:3] 3.03e-07 0.00 2.23e-07
..$ Dictyoglomus thermophilum : num [1:3] 6.66e-07 2.75e-06 1.96e-06
..$ Pelobacter carbinolicus : num [1:3] 9.69e-07 1.72e-07 1.62e-06
..$ Shewanella colwelliana : num [1:3] 1.76e-06 6.02e-07 3.91e-07
..$ Myxococcus fulvus : num [1:3] 6.66e-07 8.60e-07 1.56e-06
I would like to calc some stat for each taxa (CTRL vs SICK) and save results for each bug as a separate df (results.mw). I tried:
results.mw = lapply(mylist, function(d, l)
{
# Run wilcoxon by column
as.data.frame(wilcox.test(d, l, exact = F)$p.value)
}, d$"CTRL", l$"SICK")
but I am getting an error
Error in FUN(X[[i]], ...) : unused argument (l$SICK)

You need to loop through the taxa instead of the original list that contains the two data frames. Below I slightly edited the code, it should perform the pairwise test. I simulated the data to have something similar to what you have..
# create data function
makeData = function(){
df = data.frame(matrix(rnorm(1000*3),3,1000))
colnames(df) = paste("S",1:1000,sep="_")
rownames(df) = letters[1:3]
return(df)
}
# create two data.frames
mylist = list(
CTRL=makeData(),SICK=makeData()
)
# check
str(mylist)
# although you said species are the same
# just to be sure
# we take intersection of species names
SPECIES = intersect(names(mylist$CTRL),names(mylist$CTRL))
# loop through species
p = sapply(SPECIES, function(i)
{
# Run wilcoxon by species
wilcox.test(mylist$CTRL[,i],mylist$SICK[,i],exact=F)$p.value
})
# gives you p-value by species
head(as.data.frame(p))

Related

Reproduce list of data frame

today I wanted to run TropFishR package, the problem is (to me), every data must be arranged in list. So I tried to reconstruct the alba dataset in order to replicate with my own data in the future. Here is what I have done:
library(TropFishR)
data("alba")
str(alba) #the list contain 4 variables
List of 4
$ sample.no : int [1:14] 1 2 3 4 5 6 7 8 9 10 ...
$ midLengths: num [1:14] 1.5 2.5 3.5 4.5 5.5 6.5 7.5 8.5 9.5 10.5 ...
$ dates : Date[1:7], format: "1976-04-17" "1976-07-02" "1976-09-19" ...
$ catch : num [1:14, 1:7] 0 0 0 1 1 1 3 9 5 0 ...
..- attr(*, "dimnames")=List of 2
.. ..$ : NULL
.. ..$ : chr [1:7] "1976.29315068493" "1976.50136986301" "1976.71780821918" "1976.95616438356" ...
- attr(*, "class")= chr "lfq"
And this is what I did:
#1 We create sample.no
sample.no <- c(1:14)
sample.no
#2 We create "midlengths"
midlengths <- seq(from = 1.5, to = 14.5, by = 1)
midlengths
#3 We create "dates"
dates <- as.Date(c("1976-04-17","1976-07-02", "1976-09-19", "1976-12-15", "1977-02-18",
"1977-04-30", "1977-06-24"))
dates
#4 We create "catch"
catch <- as.matrix(read.csv(file.choose(), header=T))
#I copied the alba length freq data, move it to excel and imported as csv file
colnames(catch)<-NULL
print(catch)
#5 create list files
synLFQb <- list(sample.no,midlengths,dates,catch)
synLFQb #just checked if it turned out to be as desired format
#6 create a name for the data list
names(synLFQb) <- c("sample.no","midlengths","dates","catch")
#Finally, we need to assign the class lfq to our new object in order to allow it to be recognized by other TropFishR functions, e.g. plot.lfq:
class(synLFQb) <- "lfq"
it will produce "similar" data list
str(synLFQb)
List of 4
$ sample.no : int [1:14] 1 2 3 4 5 6 7 8 9 10 ...
$ midlengths: num [1:14] 1.5 2.5 3.5 4.5 5.5 6.5 7.5 8.5 9.5 10.5 ...
$ dates : Date[1:7], format: "1976-04-17" "1976-07-02" "1976-09-19" ...
$ catch : int [1:14, 1:7] 0 0 0 1 1 1 3 9 5 0 ...
..- attr(*, "dimnames")=List of 2
.. ..$ : NULL
.. ..$ : NULL
- attr(*, "class")= chr "lfq"
However, when everytime I tried to do this simple command:
plot(synLFQb, Fname="catch", hist.sc = 1)
It resulted in error:
> plot(synLFQb, Fname="catch", hist.sc = 1)
Error in plot.window(...) : need finite 'ylim' values
In addition: Warning messages:
1: In min(x, na.rm = na.rm) :
no non-missing arguments to min; returning Inf
2: In max(x, na.rm = na.rm) :
no non-missing arguments to max; returning -Inf
Any help will be much appreciated.
Please make sure that you call the mid lengths vector in your list "midLengths" with a capital "L". I hope that will does the trick in your example.

How can I return a list of matrices from Rcpp to R?

I have a function in Rcpp that does something like this: it creates a list of matrices of type std::list, and intends to return that list of matrices back to R.
I attach here a reduced example:
#include <RcppArmadillo.h>
// [[Rcpp::depends(RcppArmadillo)]]
// [[Rcpp::export]]
Rcpp::List splitImagesRcpp(arma::mat x)
{
std::list<arma::mat> listOfRelevantImages;
int relevantSampleSize = x.n_rows;
for(int k = 0; k < relevantSampleSize; ++k)
{
listOfRelevantImages.push_back(x.row(k));
}
return wrap(listOfRelevantImages);
}
The problem here is, I want to return to R a list of matrices, but I get a list of vectors. I have been trying a lot and looking at the documentation, but I can't seem to find a solution for this. It looks like wrap is doing its job but it is also wrapping my matrices recursively inside of the list.
I get something like this:
> str(testingMatrix)
List of 200
$ : num [1:400] 1 1 1 1 1 1 1 1 1 1 ...
$ : num [1:400] 1 1 1 1 1 1 1 1 1 1 ...
But I want to get something like this:
> str(testingMatrix)
List of 200
$ : num [1:40, 1:10] 1 1 1 1 1 1 1 1 1 1 ...
$ : num [1:40, 1:10] 1 1 1 1 1 1 1 1 1 1 ...
I want to do this from Rcpp, not in R. That is because I want to be able to interchange the function with a purely R programmed one, in order to measure the speedup.
Any help would be really appreciated!
Use the arma::field class that has the necessary plumbing to convert to and fro R and C++.
Here's some sample code as to how one would work with the field class as your above example is not reproducible...
#include <RcppArmadillo.h>
// [[Rcpp::depends(RcppArmadillo)]]
// [[Rcpp::export]]
arma::field<arma::mat> splitImagesRcpp(arma::mat x) {
// Sample size
int relevantSampleSize = x.n_rows;
// Create a field class with a pre-set amount of elements
arma::field<arma::mat> listOfRelevantImages(relevantSampleSize);
for(int k = 0; k < relevantSampleSize; ++k)
{
listOfRelevantImages(k) = x.row(k);
}
return listOfRelevantImages;
}
Example:
set.seed(1572)
(x = matrix(runif(25), 5, 5))
# [,1] [,2] [,3] [,4] [,5]
# [1,] 0.2984725 0.679958392 0.5636401 0.9681282 0.25082559
# [2,] 0.3657812 0.157172256 0.6101798 0.5743112 0.62983179
# [3,] 0.6079879 0.419813382 0.5165553 0.3922179 0.64542093
# [4,] 0.4080833 0.888144280 0.5891880 0.6170115 0.13076836
# [5,] 0.8992992 0.002045309 0.3876262 0.9850514 0.03276458
(y = splitImagesRcpp(x))
# [,1]
# [1,] Numeric,5
# [2,] Numeric,5
# [3,] Numeric,5
# [4,] Numeric,5
# [5,] Numeric,5
y[[1]]
# [,1] [,2] [,3] [,4] [,5]
# [1,] 0.2984725 0.6799584 0.5636401 0.9681282 0.2508256

Detecting number repetition in R using regex

Shouldn't this code work for repeating number detection in R?
> grep(pattern = "\\d{2}", x = 1223)
[1] 1
> grep(pattern = "\\d{3}", x = 1223)
[1] 1
If we have 988 we should get true and if 123 we should get false.
Sounds like it isn't.
> grep(pattern = "\\d{2}", x = "1223")
[1] 1
> grep(pattern = "\\d{2}", x = "13")
[1] 1
You need to use backreferences:
> grep(pattern = "(\\d)\\1", x = "1224")
[1] 1
> grep(pattern = "(\\d)\\1{1,}", x = "1224")
[1] 1
> grep(pattern = "(\\d)\\1", x = "1234")
integer(0)
EDIT: Seems like you need to figure how it works: (\\d) creates a capture group for the \\d, which can be referred to using a backreference \\1. For example, if you have numbers like x2y and you want to find those where x is the same as y, then:
> grep(pattern = "(\\d)2\\1", x = "121")
[1] 1
> grep(pattern = "(\\d)2\\1", x = "124")
integer(0)
I'd strongly recommend that you read a basic tutorial on regular expressions.
I know the question explicitly says "using regex" in the title, but here is a non-regex method that could work, depending on what you want to do.
strings <- c("1223","1233","1234","113")
# detect consecutive repeat digits, or characters
(strings.rle <- lapply(strings, function(x)rle(unlist(strsplit(x,"")))))
[[1]]
Run Length Encoding
lengths: int [1:3] 1 2 1
values : chr [1:3] "1" "2" "3"
[[2]]
Run Length Encoding
lengths: int [1:3] 1 1 2
values : chr [1:3] "1" "2" "3"
[[3]]
Run Length Encoding
lengths: int [1:4] 1 1 1 1
values : chr [1:4] "1" "2" "3" "4"
[[4]]
Run Length Encoding
lengths: int [1:2] 2 1
values : chr [1:2] "1" "3"
Now you can work with strings.rle to do what you want
# which entries have consecutive repeat digits, or characters
strings[sapply(strings.rle, function(x) any(x$lengths > 1))]
[1] "1223" "1233" "113"
or
# which digits or characters are consecutively repeated?
lapply(strings.rle, function(x) x$values[which(x$lengths > 1)])
[[1]]
[1] "2"
[[2]]
[1] "3"
[[3]]
character(0)
[[4]]
[1] "1"

R : how to differentiate between inner and innermost brackets using regex

What I need from the string ((VBD)(((JJ))(CC)((RB)(JJ)))((IN)((DT)(JJ)(NNP)(NNPS)))) is this:
"JJ", "RBJJ", "DTJJNNPNNPS", "JJCCRBJJ", "INDTJJNNPNNPS" "VBDJJCCRBJJINDTJJNNPNNPS"
that is, to find the text between innermost brackets, delete the immediately surrounding brackets so that the text can be combined and extracted. But this comprises of different levels. The uncovering of brackets can't be done all at once because the no, of brackets go out of balance:
str1<-c()
str2<-c()
library(gsubfn)
strr<-c("((VBD)(((JJ))(CC)((RB)(JJ)))((IN)((DT)(JJ)(NNP)(NNPS))))")
repeat {
str1<-unlist(strapply(strr, "((\\(([A-Z])+\\))+)"))
str2<-append(str1, str2)
strr<-gsub("(\\(\\w+\\))", "~\\1~", strr)
strr<-gsub("~\\(|\\)~", "", strr)
if (strr == "") {break}
}
strr
[1] "(VBD(JJCCRBJJINDTJJNNPNNPS"
There are brackets left blocking combining of text which makes it escape the regex. The solution to this I think is, to differentiate between innermost brackets (JJ, RB, JJ, DT, JJ, NNP, NNPS, (2, 4, 5, 7 , 8 , 9 , 10 on the fresh string)) and inner brackets. So that when all the inner most brackets are uncovered step by step and the text combined and extracted, we will reach the whole string. Is there a regular expression to do this? Or is there any other way? Please help.
This doesn't use regexp. In fact, I'm not sure that regexp are powerful enough to solve the problem and that a parser is necessary. Rather than create/define a parser in R, I leverage the existing R code parser. Doing so uses some rather potentially dangerous tricks.
The basic idea is to turn the string into parsable code which generates a tree structure using lists. Then this structure is effectively reverse pruned (keeping only the leaf node inward), and the various strings at each level are created.
Some helper packages
library("plotrix")
library("plyr")
The original string that you gave
strr<-c("((VBD)(((JJ))(CC)((RB)(JJ)))((IN)((DT)(JJ)(NNP)(NNPS))))")
Turn this string into parsable code, quoting what is inside the parentheses, and then making each set of parentheses a call to list. Commas have to be inserted between list items, but the innermost parts are always lists of length 1, so that isn't a problem. Then parse the code.
tmp <- gsub("\\(([^\\(\\)]*)\\)", '("\\1")', strr)
tmp <- gsub("\\(", "list(", tmp)
tmp <- gsub("\\)list", "),list", tmp)
tmp <- eval(parse(text=tmp))
At this point, tmp looks like
> str(tmp)
List of 3
$ :List of 1
..$ : chr "VBD"
$ :List of 3
..$ :List of 1
.. ..$ :List of 1
.. .. ..$ : chr "JJ"
..$ :List of 1
.. ..$ : chr "CC"
..$ :List of 2
.. ..$ :List of 1
.. .. ..$ : chr "RB"
.. ..$ :List of 1
.. .. ..$ : chr "JJ"
$ :List of 2
..$ :List of 1
.. ..$ : chr "IN"
..$ :List of 4
.. ..$ :List of 1
.. .. ..$ : chr "DT"
.. ..$ :List of 1
.. .. ..$ : chr "JJ"
.. ..$ :List of 1
.. .. ..$ : chr "NNP"
.. ..$ :List of 1
.. .. ..$ : chr "NNPS"
The nesting of parentheses is now nesting of lists. A few more helper functions are needed. The first collapses everything below a certain depth and throws away any node above that depth. The second is just a wrapper for paste to work one the elements of a list collectively.
atdepth <- function(l, d) {
if (d > 0 & !is.list(l)) {
return(NULL)
}
if (d == 0) {
return(unlist(l))
}
if (is.list(l)) {
llply(l, atdepth, d-1)
}
}
pastelist <- function(l) {paste(unlist(l), collapse="", sep="")}
Create a list where each element is the tree structure collapsed to a particular depth.
down <- llply(1:listDepth(tmp), atdepth, l=tmp)
Iterating backwards over this list, paste the leaf sets together. Work backwards "up" the (collapsed) trees. Doing this produces some blank strings (where there was a leaf higher up), so these are trimmed out.
out <- if (length(down) > 2) {
c(unlist(llply(length(down):3, function(i) {
unlist(do.call(llply, c(list(down[[i]]), replicate(i-3, llply), pastelist)))
})), unlist(pastelist(down[[2]])))
} else {
unlist(pastelist(down[[2]]))
}
out <- out[out != ""]
The result is what I think you asked for:
> out
[1] "JJ" "RBJJ"
[3] "DTJJNNPNNPS" "JJCCRBJJ"
[5] "INDTJJNNPNNPS" "VBDJJCCRBJJINDTJJNNPNNPS"
> dput(out)
c("JJ", "RBJJ", "DTJJNNPNNPS", "JJCCRBJJ", "INDTJJNNPNNPS", "VBDJJCCRBJJINDTJJNNPNNPS"
)
EDIT:
In response to a comment with a subsequent question: How to adapt this to process over a set of these strings.
The general approach to solving the do-it-multiple-times-for-different-inputs is to create a function which takes a single item as input and returns the associated single output. Then loop over the function with one of the apply family of functions.
Pulling together all the code from earlier into a single function:
parsestrr <- function(strr) {
atdepth <- function(l, d) {
if (d > 0 & !is.list(l)) {
return(NULL)
}
if (d == 0) {
return(unlist(l))
}
if (is.list(l)) {
llply(l, atdepth, d-1)
}
}
pastelist <- function(l) {paste(unlist(l), collapse="", sep="")}
tmp <- gsub("\\(([^\\(\\)]*)\\)", '("\\1")', strr)
tmp <- gsub("\\(", "list(", tmp)
tmp <- gsub("\\)list", "),list", tmp)
tmp <- eval(parse(text=tmp))
down <- llply(1:listDepth(tmp), atdepth, l=tmp)
out <- if (length(down) > 2) {
c(unlist(llply(length(down):3, function(i) {
unlist(do.call(llply, c(list(down[[i]]), replicate(i-3, llply), pastelist)))
})), unlist(pastelist(down[[2]])))
} else {
unlist(pastelist(down[[2]]))
}
out[out != ""]
}
Now given a vector of strings to process, say:
strrs<-c("((VBD)(((JJ))(CC)((RB)(JJ)))((IN)((DT)(JJ)(NNP)(NNPS))))",
"((VBD)(((JJ))(CC)((RB)(XX)(JJ)))((IN)(BB)((DT)(JJ)(NNP)(NNPS))))",
"((VBD)(((JJ)(QQ))(CC)((RB)(JJ)))((IN)((TQR)(JJ)(NNPS))))")
You can process all of them with
llply(strr, parsestrr)
which returns
[[1]]
[1] "JJ" "RBJJ"
[3] "DTJJNNPNNPS" "JJCCRBJJ"
[5] "INDTJJNNPNNPS" "VBDJJCCRBJJINDTJJNNPNNPS"
[[2]]
[1] "JJ" "RBXXJJ"
[3] "DTJJNNPNNPS" "JJCCRBXXJJ"
[5] "INBBDTJJNNPNNPS" "VBDJJCCRBXXJJINBBDTJJNNPNNPS"
[[3]]
[1] "JJQQ" "RBJJ"
[3] "TQRJJNNPS" "JJQQCCRBJJ"
[5] "INTQRJJNNPS" "VBDJJQQCCRBJJINTQRJJNNPS"
I'm not sure if you just want to build a tree structure of balanced text or not.
Or, why you want to strip the containing parenthesis on the inner most level.
Using your example, if it is to be done in stages, the inner most level has to be initially determined. Then parenthesis stripped off in subsequent levels in recursive passes.
This of course requires a way to do balanced text. Some regex engines can do this.
If the engine you are using doesn't support this, it would have to be done manually via text processing.
I happen to have a regex analysis program. I pumped your initial string into it and it visually formatted it via group levels. Each pass, I just stripped the inner parenth's which simulates a recursion.
Maybe this can help you to visualize what needs to be done.
## Pass 0
## ---------
(
( VBD )
(
(
( JJ )
)
( CC )
(
( RB )
( JJ )
)
)
(
( IN )
(
( DT )
( JJ )
( NNP )
( NNPS )
)
)
)
## Pass 1
## ---------
(
( VBD )
(
( JJ )
( CC )
( RB JJ )
)
(
( IN )
( DT JJ NNP NNPS )
)
)
## Pass 2
## ---------
(
( VBD )
( JJ CC RB JJ )
( IN DT JJ NNP NNPS )
)
## Pass 3
## ---------
( VBD JJ CC RB JJ IN DT JJ NNP NNPS )
## Pass 4
## ---------
VBD JJ CC RB JJ IN DT JJ NNP NNPS
You don't really need to think of matching brackets here... Sounds like you just want to recursively match the pattern [()]([^()]*)[()].
That is, "match something containing no ( ) and delimited by ( or )"

which list element is being processed when using snowfall::sfLapply?

Assume we have a list (mylist) that is use as input object for a lapply function. Is there a way to know which element in mylist is being evaluated? The method should work on lapply and snowfall::sfApply (and possible others apply family members) as well.
On chat, Gavin Simpson suggested the following method. This works great for lapply but not so much for sfApply. I would like to avoid extra packages or fiddling with the list. Any suggestions?
mylist <- list(a = 1:10, b = 1:10)
foo <- function(x) {
deparse(substitute(x))
}
bar <- lapply(mylist, FUN = foo)
> bar
$a
[1] "X[[1L]]"
$b
[1] "X[[2L]]"
This is the parallel version that isn't cutting it.
library(snowfall)
sfInit(parallel = TRUE, cpus = 2, type = "SOCK") # I use 2 cores
sfExport("foo", "mylist")
bar.para <- sfLapply(x = mylist, fun = foo)
> bar.para
$a
[1] "X[[1L]]"
$b
[1] "X[[1L]]"
sfStop()
I think you are going to have to use Shane's solution/suggestion in that chat session. Store your objects in a list such that each component of the top list contains a component with the name or ID or experiment contained in that list component, plus a component containing the object you want to process:
obj <- list(list(ID = 1, obj = 1:10), list(ID = 2, obj = 1:10),
list(ID = 3, obj = 1:10), list(ID = 4, obj = 1:10),
list(ID = 5, obj = 1:10))
So we have the following structure:
> str(obj)
List of 5
$ :List of 2
..$ ID : num 1
..$ obj: int [1:10] 1 2 3 4 5 6 7 8 9 10
$ :List of 2
..$ ID : num 2
..$ obj: int [1:10] 1 2 3 4 5 6 7 8 9 10
$ :List of 2
..$ ID : num 3
..$ obj: int [1:10] 1 2 3 4 5 6 7 8 9 10
$ :List of 2
..$ ID : num 4
..$ obj: int [1:10] 1 2 3 4 5 6 7 8 9 10
$ :List of 2
..$ ID : num 5
..$ obj: int [1:10] 1 2 3 4 5 6 7 8 9 10
The have something like the first line in the following function, followed by your
foo <- function(x) {
writeLines(paste("Processing Component:", x$ID))
sum(x$obj)
}
Which will do this:
> res <- lapply(obj, foo)
Processing Component: 1
Processing Component: 2
Processing Component: 3
Processing Component: 4
Processing Component: 5
Which might work on snowfall.
I could also alter the attributes like so.
mylist <- list(a = 1:10, b = 1:10)
attr(mylist[[1]], "seq") <- 1
attr(mylist[[2]], "seq") <- 2
foo <- function(x) {
writeLines(paste("Processing Component:", attributes(x)))
}
bar <- lapply(mylist, FUN = foo)
(and the parallel version)
mylist <- list(a = 1:10, b = 1:10)
attr(mylist[[1]], "seq") <- 1
attr(mylist[[2]], "seq") <- 2
foo <- function(x) {
x <- paste("Processing Component:", attributes(x))
}
sfExport("mylist", "foo")
bar <- sfLapply(mylist, fun = foo)

Categories