How to include variables values into regular expressions in R - regex

I have 5 files which contain metabolites (details of different bacteria models). I'm writing a function to append a specified number of files. File names look like the following.
[1] "01_iAPECO1_1312_metabolites.csv" "02_iB21_1397_metabolites.csv"
[3] "03_iBWG_1329_metabolites.csv" "04_ic_1306_metabolites.csv"
[5] "05_iE2348C_1286_metabolites.csv"
Below is my function.
strat = 3 # defines the starting position of the range
end = 5 # defines the ending position of the range
type = "metabolites" # two types of files - for metabolites and reactions
files <- NULL
if (type == "metabolites"){
files <- list.files(pattern = "*metabolites\\.csv$")
}else if(type == "reactions"){
files <- list.files(pattern = "*reactions\\.csv$")
}
#reading each file within the range and append them to create one file
for (i in start:end){
temp_df <- data.frame(ModelName = character(), Object = character(),stringsAsFactors = F)
#reading the current file
temp = rbind(one,temp_df)
}
#writing the appended file
write.csv(temp,"appended.csv",row.names = F,quote = F)
temp_df <- NULL
For example, if I specify the start=3 and end = 5, the code is supposed to read files 03, 04 and 05 and append them. Note: the two integers at the beginning of the file names are used to get the file referenced by the range. I'm unable to select the required file within the for loop using a regular expression. When I specify the number it picks up but I'm looking for a generalized version with i in it.
currentFile = grep("01.+",files)
Any help is appreciated.

For the test data shown below this returns a vector containing the file names of the files that start with 02, 03, 04 and 05 and end with "reactions.csv"
# create some test files
for(i in 1:5) cat(file = sprintf("%02djunkreactions[.]csv", i))
# test input
start <- 2
end <- 5
type <- "reactions"
list.files(pattern = paste(sprintf("^%02d.*%s[.]csv$", start:end, type), collapse = "|"))
giving:
[1] "02junkreactions.csv" "03junkreactions.csv" "04junkreactions.csv"
[4] "05junkreactions.csv"
Note: If start and end are both always one digit then a simplification is possible:
list.files(pattern = sprintf("^0[%d-%d].*%s.csv$", start, end, type))

You can do this with a cross-join.
library(dplyr)
library(stringi)
start = 3
end = 5
type = "metabolites"
all_files = data_frame(file = list.files() )
desired_files = data_frame(
number = start:end,
regex = sprintf("^%02.f.*%s", number, file_type) )
all_files %>%
merge(desired_files) %>%
filter(stri_detect_regex(file, regex)) %>%
group_by(number) %>%
do(read.csv(.$file) ) %>%
write.csv("appended.csv", row.names = F, quote = F)

Are you looking for something like this?
files <- c("01_iAPECO1_1312_metabolites.csv", "02_iB21_1397_metabolites.csv","03_iBWG_1329_metabolites.csv", "04_ic_1306_metabolites.csv","05_iE2348C_1286_metabolites.csv")
for(i in 2:4) print(grep(sprintf("^(%02d){1}_",i),files,value=T))

Related

How to create a crosstab with variable labels for PDF output in R markdown

I would like to make a table in R markdown that prints a crosstabulation of two variables and includes the variable name above it and on the left side. Also, I need to print this to a PDF so I require code that is compatible with kable("latex").
Reproducible example:
set.seed(143)
x <- sample(x = c("yes", "no"), size = 20, replace = TRUE)
y <- sample(x = c("yes", "no"), size = 20, replace = TRUE)
table(x,y) %>%
kable("latex") %>%
pack_rows("X", 1, 2) %>%
add_header_above(c(" ", "Y" = 2))
Which gives the following output:
However I would like it to look like this (created in Word for example):

str_detect removing some but not all strings with specified ending

I'd like to remove any string that ends in either of 2 characters in a pipe. In this example it's ".o" or ".t". Some of them get removed, but not all of them, and I can't figure out why. I suspect something is wrong in the 'pattern = ' argument.
ex1 <- structure(list(variables = structure(1:18, .Label = c("canopy15",
"canopy16", "DistanceToRoad", "DistanceToEdge", "EdgeDistance",
"TrailDistance", "CARCOR.o", "EUOALA.o", "FAGGRA.o", "LINBEN.o",
"MALSP..o", "PRUSER.o", "ROSMUL.o", "RUBPHO.o", "VIBDEN.o", "ACERUB.t",
"FAGGRA.t", "NYSSYL.t"), class = "factor")), row.names = c(NA,
-18L), class = "data.frame")
ex1 %>%
dplyr::filter(stringr::str_detect(string = variables,
pattern = c("\\.o$", "\\.t$"),
negate = TRUE))
##output
# variables
# 1 canopy15
# 2 canopy16
# 3 DistanceToRoad
# 4 DistanceToEdge
# 5 EdgeDistance
# 6 TrailDistance
# 7 EUOALA.o
# 8 LINBEN.o
# 9 PRUSER.o
# 10 RUBPHO.o
# 11 FAGGRA.t
The pattern has multiple elements, so it is recycling, and thus checking o$ for one row, and then t$ for the next row, and so on.. Try this instead:
ex1 %>%
dplyr::filter(stringr::str_detect(string = variables,
pattern = c("\\.(o|t)$"),
negate = TRUE))
For those not as well-versed in regular expressions, here is a simpler answer.
library(tidyverse)
ex1 %>% filter(str_detect(string = variables, pattern = ".t$", negate = TRUE),
str_detect(string = variables, pattern = ".o$", negate = TRUE))

Applying Rcpp on a dataframe

I'm new to C++ and exploring faster computation possibilities on R through the Rcpp package. The actual dataframe contains over ~2 million rows, and is quite slow.
Existing Dataframes
Main Dataframe
df<-data.frame(z = c("a","b","c"), a = c(303,403,503), b = c(203,103,803), c = c(903,803,703))
Cost Dataframe
cost <- data.frame("103" = 4, "203" = 5, "303" = 6, "403" = 7, "503" = 8, "603" = 9, "703" = 10, "803" = 11, "903" = 12)
colnames(cost) <- c("103", "203", "303", "403", "503", "603", "703", "803", "903")
Steps
df contains z which is a categorical variable with levels a, b and c. I had done a merge operation from another dataframe to bring in a,b,c into df with the specific nos.
First step would be to match each row in z with the column names (a,b or c) and create a new column called 'type' and copy the corresponding number.
So the first row would read,
df$z[1] = "a"
df$type[1]= 303
Now it must match df$type with column names in another dataframe called 'cost' and create df$cost. The cost dataframe contains column names as numbers e.g. "103", "203" etc.
For our example, df$cost[1] = 6. It matches df$type[1] = 303 with cost$303[1]=6
Final Dataframe should look like this - Created a sample output
df1 <- data.frame(z = c("a","b","c"), type = c("303", "103", "703"), cost = c(6,4,10))
A possible solution, not very elegant but does the job:
library(reshape2)
tmp <- cbind(cost,melt(df)) # create a unique data frame
row.idx <- which(tmp$z==tmp$variable) # row index of matching values
col.val <- match(as.character(tmp$value[row.idx]), names(tmp) ) # find corresponding values in the column names
# now put all together
df2 <- data.frame('z'=unique(df$z),
'type' = tmp$value[row.idx],
'cost' = as.numeric(tmp[1,col.val]) )
the output:
> df2
z type cost
1 a 303 6
2 b 103 4
3 c 703 10
see if it works

Aggregate modis list files by month

I am looking for a more efficient way of separating each year from the time series (2002-2016) by month. I've done it by hand, but it takes a lot.
mypath<-"D:/SNOWL"
myras<-list.files(path=mypath,pattern = glob2rx("*.tif$"),
full.names = TRUE, recursive = TRUE)
> myras
[1] "D:/SNOWL/MOYDSL10A1.A2002001.tif" "D:/SNOWL/MOYDSL10A1.A2002002.tif"
[3] "D:/SNOWL/MOYDSL10A1.A2002003.tif" "D:/SNOWL/MOYDSL10A1.A2002004.tif"
[5] "D:/SNOWL/MOYDSL10A1.A2002005.tif" "D:/SNOWL/MOYDSL10A1.A2002006.tif"
[7] "D:/SNOWL/MOYDSL10A1.A2002007.tif" "D:/SNOWL/MOYDSL10A1.A2002008.tif"
[9] "D:/SNOWL/MOYDSL10A1.A2002009.tif" "D:/SNOWL/MOYDSL10A1.A2002010.tif"
[11] "D:/SNOWL/MOYDSL10A1.A2002011.tif" "D:/SNOWL/MOYDSL10A1.A2002012.tif"
serie<-orgTime(myras, nDays = "asIn", begin ="2002-01-01",end = "2016-12-31", pillow = 75, pos1 = 13, pos2 = 19)
filter<-serie$inputLayerDates
> filter
[1] "2002-01-01" "2002-01-02" "2002-01-03" "2002-01-04" "2002-01-05"
[6] "2002-01-06" "2002-01-07" "2002-01-08" "2002-01-09" "2002-01-10"
[11] "2002-01-11" "2002-01-12" "2002-01-13" "2002-01-14" "2002-01-15"
[16] "2002-01-16" "2002-01-17" "2002-01-18" "2002-01-19" "2002-01-20"
[21] "2002-01-21" "2002-01-22" "2002-01-23" "2002-01-24" "2002-01-25"
[26] "2002-01-26" "2002-01-27" "2002-01-28" "2002-01-29" "2002-01-30"
[31] "2002-01-31" "2002-02-01" "2002-02-02" "2002-02-03" "2002-02-04"
[36] "2002-02-05" "2002-02-07" "2002-02-08" "2002-02-09" "2002-02-10"
[41] "2002-02-11" "2002-02-12" "2002-02-13" "2002-02-14" "2002-02-15"
EDIT:
Ok, let's try a full size example and see if it's working for you:
# Here we generate filenames as returned from `list.files`:
rm(list = ls())
myras <- sapply(1:5465, function(i) paste0('D:/SNOWL/MOYDSL10A1.A',sample(2000:2016,1),sample(c(paste0('00',1:9),paste0('0',10:99),100:365),1),'.tif'))
head(myras)
# Let's extract the timestamps
tstmps <- regmatches(myras,regexpr('[[:digit:]]{7}',myras))
head(tstmps,50)
# And now convert the timestamps to dates
dates <- as.Date(as.numeric(substr(tstmps,5,7)) - 1, origin = paste0(substr(tstmps,1,4),"-01-01"))
head(dates,10)
# Last step is to sort the files by month
#check months
print(month.name)
myras_byM = sapply(month.name,function(x) myras[months(dates) == x])
head(myras_byM$January)
head(myras_byM$February)
head(myras_byM$March)
head(myras_byM$April)
head(myras_byM$May)
head(myras_byM$June)
head(myras_byM$July)
head(myras_byM$August)
head(myras_byM$September)
head(myras_byM$October)
head(myras_byM$November)
head(myras_byM$December)
You can easily get the date from your filename, if you have a consistent naming convention.
In your case, I see the files are ordered by year and day of the year. So just strip the date from the filename, and then you can filter it by whatever you need. To do this I'm using regular expressions. In this case, I'm interested in the date and DOY string, which should always be 7 numbers. The corresponding RE is therefore [[:digit:]]{7}, which means 7 consecutive digits. regexpr finds the matches and regmatches returns them.
dts <- regmatches(myras,regexpr('[[:digit:]]{7}',myras))
Then you just use substring to extract the digits you need (this method assumes it's always 4 digits for year followed by 3 for DOY) and convert it to a date:
dts <-as.Date(as.numeric(substr(dts,5,7)) - 1, origin = paste0(substr(dts,1,4),"-01-01"))
That would give you the variable of filter you have in your example.
If you then want to sort the entire time series by month, you could use sapply or lapply with the built-in names month.name. The base function months will return you the name of the month for a given date:
myras_byMonth <- sapply(month.name,function(x) myras[months(dts) == x])
Hope I understood your question correctly and this was what you were looking for.
Best,
Val

R: Regress all variables that match certain pattern

Is there a way in R to add all variables into a regression that match a certain pattern? For example, I have a bunch of variables in my dataset that correspond to holidays with the prefix h_ and I have other variables with other prefixes such as a_
Is there a way to do something like this:
lm(homicide ~ w_* + a_*, data= df)
To programmatically construct a formula, have a look at reformulate().
Here's an example that uses grep() to find all variables that begin with a "d" and then uses reformulate() to plug them in as the regressor variables on the RHS of a formula object.
vv <- grep("^d.*", names(mtcars), value=TRUE)
ff <- reformulate(termlabels=vv, response="mpg")
lm(ff, data=mtcars)
#
# Call:
# lm(formula = ff, data = mtcars)
#
# Coefficients:
# (Intercept) disp drat
# 21.84488 -0.03569 1.80203
A string can be turned into a formula.
data(iris)
fmla <- as.formula(paste("Species ~",
paste(grep("Width", names(iris), value = TRUE), collapse = " + ")))
glm(fmla, data = iris, family = binomial(link = "logit"))