Aggregate modis list files by month - r-raster

I am looking for a more efficient way of separating each year from the time series (2002-2016) by month. I've done it by hand, but it takes a lot.
mypath<-"D:/SNOWL"
myras<-list.files(path=mypath,pattern = glob2rx("*.tif$"),
full.names = TRUE, recursive = TRUE)
> myras
[1] "D:/SNOWL/MOYDSL10A1.A2002001.tif" "D:/SNOWL/MOYDSL10A1.A2002002.tif"
[3] "D:/SNOWL/MOYDSL10A1.A2002003.tif" "D:/SNOWL/MOYDSL10A1.A2002004.tif"
[5] "D:/SNOWL/MOYDSL10A1.A2002005.tif" "D:/SNOWL/MOYDSL10A1.A2002006.tif"
[7] "D:/SNOWL/MOYDSL10A1.A2002007.tif" "D:/SNOWL/MOYDSL10A1.A2002008.tif"
[9] "D:/SNOWL/MOYDSL10A1.A2002009.tif" "D:/SNOWL/MOYDSL10A1.A2002010.tif"
[11] "D:/SNOWL/MOYDSL10A1.A2002011.tif" "D:/SNOWL/MOYDSL10A1.A2002012.tif"
serie<-orgTime(myras, nDays = "asIn", begin ="2002-01-01",end = "2016-12-31", pillow = 75, pos1 = 13, pos2 = 19)
filter<-serie$inputLayerDates
> filter
[1] "2002-01-01" "2002-01-02" "2002-01-03" "2002-01-04" "2002-01-05"
[6] "2002-01-06" "2002-01-07" "2002-01-08" "2002-01-09" "2002-01-10"
[11] "2002-01-11" "2002-01-12" "2002-01-13" "2002-01-14" "2002-01-15"
[16] "2002-01-16" "2002-01-17" "2002-01-18" "2002-01-19" "2002-01-20"
[21] "2002-01-21" "2002-01-22" "2002-01-23" "2002-01-24" "2002-01-25"
[26] "2002-01-26" "2002-01-27" "2002-01-28" "2002-01-29" "2002-01-30"
[31] "2002-01-31" "2002-02-01" "2002-02-02" "2002-02-03" "2002-02-04"
[36] "2002-02-05" "2002-02-07" "2002-02-08" "2002-02-09" "2002-02-10"
[41] "2002-02-11" "2002-02-12" "2002-02-13" "2002-02-14" "2002-02-15"

EDIT:
Ok, let's try a full size example and see if it's working for you:
# Here we generate filenames as returned from `list.files`:
rm(list = ls())
myras <- sapply(1:5465, function(i) paste0('D:/SNOWL/MOYDSL10A1.A',sample(2000:2016,1),sample(c(paste0('00',1:9),paste0('0',10:99),100:365),1),'.tif'))
head(myras)
# Let's extract the timestamps
tstmps <- regmatches(myras,regexpr('[[:digit:]]{7}',myras))
head(tstmps,50)
# And now convert the timestamps to dates
dates <- as.Date(as.numeric(substr(tstmps,5,7)) - 1, origin = paste0(substr(tstmps,1,4),"-01-01"))
head(dates,10)
# Last step is to sort the files by month
#check months
print(month.name)
myras_byM = sapply(month.name,function(x) myras[months(dates) == x])
head(myras_byM$January)
head(myras_byM$February)
head(myras_byM$March)
head(myras_byM$April)
head(myras_byM$May)
head(myras_byM$June)
head(myras_byM$July)
head(myras_byM$August)
head(myras_byM$September)
head(myras_byM$October)
head(myras_byM$November)
head(myras_byM$December)
You can easily get the date from your filename, if you have a consistent naming convention.
In your case, I see the files are ordered by year and day of the year. So just strip the date from the filename, and then you can filter it by whatever you need. To do this I'm using regular expressions. In this case, I'm interested in the date and DOY string, which should always be 7 numbers. The corresponding RE is therefore [[:digit:]]{7}, which means 7 consecutive digits. regexpr finds the matches and regmatches returns them.
dts <- regmatches(myras,regexpr('[[:digit:]]{7}',myras))
Then you just use substring to extract the digits you need (this method assumes it's always 4 digits for year followed by 3 for DOY) and convert it to a date:
dts <-as.Date(as.numeric(substr(dts,5,7)) - 1, origin = paste0(substr(dts,1,4),"-01-01"))
That would give you the variable of filter you have in your example.
If you then want to sort the entire time series by month, you could use sapply or lapply with the built-in names month.name. The base function months will return you the name of the month for a given date:
myras_byMonth <- sapply(month.name,function(x) myras[months(dts) == x])
Hope I understood your question correctly and this was what you were looking for.
Best,
Val

Related

date specific manipulation in a list in R

I have a list in lists
list1 <- list()
list1$date <- c("01/06/2002", "02/06/2002", "03/06/2002",
"04/06/2002", "05/06/2002", "01/07/2002", "19/07/2002", "11/07/2002",
"15/07/2002", "29/07/2002", "03/07/2002")
list1$value1 <- c(100,200,300,100,200,300,100,200,300,100,200)
I am trying to scale the "value1" which is maximum during the first week and also the last 2 days of the month. That is:
if the value is in between the dates 01 and 07 - only the maximum of the value must be doubled
If the date is >=28 then also the value needs to be doubled
Is there way where I can do this?
The lubridate package provides a variety of convenient date functions
library(lubridate)
list1 <- list()
list1$date <- c("01/06/2002", "02/06/2002", "03/06/2002",
"04/06/2002", "05/06/2002", "01/07/2002", "19/07/2002", "11/07/2002",
"15/07/2002", "29/07/2002", "03/07/2002")
list1$value1 <- c(100,200,300,100,200,300,100,200,300,100,200)
The list1$date object are strings, to use lubrdate's dmy (for day-month-year) to convert into a Date class, and then use the day() function to extract the numeric date of the month.
Assign to the doubles variable the dates that are in the first weeek (ie, less than day 7) or after day 28.
first7 <- day(dmy(list1$date)) <= 7
after28 <- day(dmy(list1$date)) >= 28
doubles <- (first7 & list1$value1 == max(list1$value1[first7],na.rm=T)) | after28
Assign to a coefficients variable the values that meet the doubles criteria and those that do not (simply multiply by 1).
coefficients <- ifelse(doubles,2,1)
Multiply the list1$value by the coefficients to get the required result
list1$value1 * coefficients

How to include variables values into regular expressions in R

I have 5 files which contain metabolites (details of different bacteria models). I'm writing a function to append a specified number of files. File names look like the following.
[1] "01_iAPECO1_1312_metabolites.csv" "02_iB21_1397_metabolites.csv"
[3] "03_iBWG_1329_metabolites.csv" "04_ic_1306_metabolites.csv"
[5] "05_iE2348C_1286_metabolites.csv"
Below is my function.
strat = 3 # defines the starting position of the range
end = 5 # defines the ending position of the range
type = "metabolites" # two types of files - for metabolites and reactions
files <- NULL
if (type == "metabolites"){
files <- list.files(pattern = "*metabolites\\.csv$")
}else if(type == "reactions"){
files <- list.files(pattern = "*reactions\\.csv$")
}
#reading each file within the range and append them to create one file
for (i in start:end){
temp_df <- data.frame(ModelName = character(), Object = character(),stringsAsFactors = F)
#reading the current file
temp = rbind(one,temp_df)
}
#writing the appended file
write.csv(temp,"appended.csv",row.names = F,quote = F)
temp_df <- NULL
For example, if I specify the start=3 and end = 5, the code is supposed to read files 03, 04 and 05 and append them. Note: the two integers at the beginning of the file names are used to get the file referenced by the range. I'm unable to select the required file within the for loop using a regular expression. When I specify the number it picks up but I'm looking for a generalized version with i in it.
currentFile = grep("01.+",files)
Any help is appreciated.
For the test data shown below this returns a vector containing the file names of the files that start with 02, 03, 04 and 05 and end with "reactions.csv"
# create some test files
for(i in 1:5) cat(file = sprintf("%02djunkreactions[.]csv", i))
# test input
start <- 2
end <- 5
type <- "reactions"
list.files(pattern = paste(sprintf("^%02d.*%s[.]csv$", start:end, type), collapse = "|"))
giving:
[1] "02junkreactions.csv" "03junkreactions.csv" "04junkreactions.csv"
[4] "05junkreactions.csv"
Note: If start and end are both always one digit then a simplification is possible:
list.files(pattern = sprintf("^0[%d-%d].*%s.csv$", start, end, type))
You can do this with a cross-join.
library(dplyr)
library(stringi)
start = 3
end = 5
type = "metabolites"
all_files = data_frame(file = list.files() )
desired_files = data_frame(
number = start:end,
regex = sprintf("^%02.f.*%s", number, file_type) )
all_files %>%
merge(desired_files) %>%
filter(stri_detect_regex(file, regex)) %>%
group_by(number) %>%
do(read.csv(.$file) ) %>%
write.csv("appended.csv", row.names = F, quote = F)
Are you looking for something like this?
files <- c("01_iAPECO1_1312_metabolites.csv", "02_iB21_1397_metabolites.csv","03_iBWG_1329_metabolites.csv", "04_ic_1306_metabolites.csv","05_iE2348C_1286_metabolites.csv")
for(i in 2:4) print(grep(sprintf("^(%02d){1}_",i),files,value=T))

How to very efficiently extract specific pattern from characters?

I have big data like this :
> Data[1:7,1]
[1] mature=hsa-miR-5087|mir_Family=-|Gene=OR4F5
[2] mature=hsa-miR-26a-1-3p|mir_Family=mir-26|Gene=OR4F9
[3] mature=hsa-miR-448|mir_Family=mir-448|Gene=OR4F5
[4] mature=hsa-miR-659-3p|mir_Family=-|Gene=OR4F5
[5] mature=hsa-miR-5197-3p|mir_Family=-|Gene=OR4F5
[6] mature=hsa-miR-5093|mir_Family=-|Gene=OR4F5
[7] mature=hsa-miR-650|mir_Family=mir-650|Gene=OR4F5
what I want to do is that, in every row, I want to select the name after word mature= and also the word after Gene= and then pater them together with
paste(a,b, sep="-")
for example, the expected output from first two rows would be like :
hsa-miR-5087-OR4F5
hsa-miR-26a-1-3p-OR4F9
so, the final implementation is like this:
for(i in 1:nrow(Data)){
Data[i,3] <- sub("mature=([^|]*).*Gene=(.*)", "\\1-\\2", Data[i,1])
Name <- strsplit(as.vector(Data[i,2]),"\\|")[[1]][2]
Data[i,4] <- as.numeric(sub("pvalue=","",Name))
print(i)
}
which work well, but it's very slow. the size of Data is very big and it has 200,000,000 rows. this implementation is very slow for that. how can I speed it up ?
If you can guarantee that the format is exactly as you specified, then a regular expression can capture (denoted by the brackets below) everything from the equals sign upto the pipe symbol, and from the Gene= to the end, and paste them together with a minus sign:
sub("mature=([^|]*).*Gene=(.*)", "\\1-\\2", Data[,1])
Another option is to use read.table with = as a separator then pasting the 2 columns:
res = read.table(text=txt,sep='=')
paste(sub('[|].*','',res$V2), ## get rid from last part here
sub('^ +| +$','',res$V4),sep='-') ## remove extra spaces
[1] "hsa-miR-5087-OR4F5" "hsa-miR-26a-1-3p-OR4F9" "hsa-miR-448-OR4F5" "hsa-miR-659-3p-OR4F5"
[5] "hsa-miR-5197-3p-OR4F5" "hsa-miR-5093-OR4F5" "hsa-miR-650-OR4F5"
The simple sub solution already given looks quite nice but just in case here are some other approaches:
1) read.pattern Using read.pattern in the gsubfn package we can parse the data into a data.frame. This intermediate form, DF, can then be manipulated in many ways. In this case we use paste in essentially the same way as in the question:
library(gsubfn)
DF <- read.pattern(text = Data[, 1], pattern = "(\\w+)=([^|]*)")
paste(DF$V2, DF$V6, sep = "-")
giving:
[1] "hsa-miR-5087-OR4F5" "hsa-miR-26a-1-3p-OR4F9" "hsa-miR-448-OR4F5"
[4] "hsa-miR-659-3p-OR4F5" "hsa-miR-5197-3p-OR4F5" "hsa-miR-5093-OR4F5"
[7] "hsa-miR-650-OR4F5"
The intermediate data frame, DF, that was produced looks like this:
> DF
V1 V2 V3 V4 V5 V6
1 mature hsa-miR-5087 mir_Family - Gene OR4F5
2 mature hsa-miR-26a-1-3p mir_Family mir-26 Gene OR4F9
3 mature hsa-miR-448 mir_Family mir-448 Gene OR4F5
4 mature hsa-miR-659-3p mir_Family - Gene OR4F5
5 mature hsa-miR-5197-3p mir_Family - Gene OR4F5
6 mature hsa-miR-5093 mir_Family - Gene OR4F5
7 mature hsa-miR-650 mir_Family mir-650 Gene OR4F5
Here is a visualization of the regular expression we used:
(\w+)=([^|]*)
Debuggex Demo
1a) names We could make DF look nicer by reading the three columns of data and the three names separately. This also improves the paste statement:
DF <- read.pattern(text = Data[, 1], pattern = "=([^|]*)")
names(DF) <- unlist(read.pattern(text = Data[1,1], pattern = "(\\w+)=", as.is = TRUE))
paste(DF$mature, DF$Gene, sep = "-") # same answer as above
The DF in this section that was produced looks like this. It has 3 instead of 6 columns and remaining columns were used to determine appropriate column names:
> DF
mature mir_Family Gene
1 hsa-miR-5087 - OR4F5
2 hsa-miR-26a-1-3p mir-26 OR4F9
3 hsa-miR-448 mir-448 OR4F5
4 hsa-miR-659-3p - OR4F5
5 hsa-miR-5197-3p - OR4F5
6 hsa-miR-5093 - OR4F5
7 hsa-miR-650 mir-650 OR4F5
2) strapplyc
Another approach using the same package. This extracts the fields coming after a = and not containing a | producing a list. We then sapply over that list pasting the first and third fields together:
sapply(strapplyc(Data[, 1], "=([^|]*)"), function(x) paste(x[1], x[3], sep = "-"))
giving the same result.
Here is a visualization of the regular expression used:
=([^|]*)
Debuggex Demo
Here is one approach:
Data <- readLines(n = 7)
mature=hsa-miR-5087|mir_Family=-|Gene=OR4F5
mature=hsa-miR-26a-1-3p|mir_Family=mir-26|Gene=OR4F9
mature=hsa-miR-448|mir_Family=mir-448|Gene=OR4F5
mature=hsa-miR-659-3p|mir_Family=-|Gene=OR4F5
mature=hsa-miR-5197-3p|mir_Family=-|Gene=OR4F5
mature=hsa-miR-5093|mir_Family=-|Gene=OR4F5
mature=hsa-miR-650|mir_Family=mir-650|Gene=OR4F5
df <- read.table(sep = "|", text = Data, stringsAsFactors = FALSE)
l <- lapply(df, strsplit, "=")
trim <- function(x) gsub("^\\s*|\\s*$", "", x)
paste(trim(sapply(l[[1]], "[", 2)), trim(sapply(l[[3]], "[", 2)), sep = "-")
# [1] "hsa-miR-5087-OR4F5" "hsa-miR-26a-1-3p-OR4F9" "hsa-miR-448-OR4F5" "hsa-miR-659-3p-OR4F5" "hsa-miR-5197-3p-OR4F5" "hsa-miR-5093-OR4F5"
# [7] "hsa-miR-650-OR4F5"
Maybe not the more elegant but you can try :
sapply(Data[,1],function(x){
parts<-strsplit(x,"\\|")[[1]]
y<-paste(gsub("(mature=)|(Gene=)","",parts[grepl("mature|Gene",parts)]),collapse="-")
return(y)
})
Example
Data<-data.frame(col1=c("mature=hsa-miR-5087|mir_Family=-|Gene=OR4F5","mature=hsa-miR-26a-1-3p|mir_Family=mir-26|Gene=OR4F9"),col2=1:2,stringsAsFactors=F)
> Data[,1]
[1] "mature=hsa-miR-5087|mir_Family=-|Gene=OR4F5" "mature=hsa-miR-26a-1-3p|mir_Family=mir-26|Gene=OR4F9"
> sapply(Data[,1],function(x){
+ parts<-strsplit(x,"\\|")[[1]]
+ y<-paste(gsub("(mature=)|(Gene=)","",parts[grepl("mature|Gene",parts)]),collapse="-")
+ return(y)
+ })
mature=hsa-miR-5087|mir_Family=-|Gene=OR4F5 mature=hsa-miR-26a-1-3p|mir_Family=mir-26|Gene=OR4F9
"hsa-miR-5087-OR4F5" "hsa-miR-26a-1-3p-OR4F9"

How to add column to data.table with values from list based on regex

I have the following data.table:
id fShort
1 432-12 1245
2 3242-12 453543
3 324-32 45543
4 322-34 45343
5 2324-34 13543
DT <- data.table(
id=c("432-12", "3242-12", "324-32", "322-34", "2324-34"),
fShort=c("1245", "453543", "45543", "45343", "13543"))
and the following list:
filenames <- list("3242-124342345.png", "432-124343.png", "135-13434.jpeg")
I would like to create a new column "fComplete" that includes the complete filename from the list. For this the values of column "id" need to be matched with the filename-list. If the filename starts with the "id" string, the complete filename should be returned. I use the following regex
t <- grep("432-12","432-124343.png",value=T)
that return the correct filename.
This is how the final table should look like:
id fShort fComplete
1 432-12 1245 432-124343.png
2 3242-12 453543 3242-124342345.png
3 324-32 45543 NA
4 322-34 45343 NA
5 2324-34 13543 NA
DT2 <- data.table(
id=c("432-12", "3242-12", "324-32", "322-34", "2324-34"),
fshort=c("1245", "453543", "45543", "45343", "13543"),
fComplete = c("432-124343.png", "3242-124342345.png", NA, NA, NA))
I tried using apply and data.table approaches but I always get warnings like
argument 'pattern' has length > 1 and only the first element will be used
What is a simple approach to accomplish this?
Here's a data.table solution:
DT[ , fComplete := lapply(id, function(x) {
m <- grep(x, filenames, value = TRUE)
if (!length(m)) NA else m})]
id fShort fComplete
1: 432-12 1245 432-124343.png
2: 3242-12 453543 3242-124342345.png
3: 324-32 45543 NA
4: 322-34 45343 NA
5: 2324-34 13543 NA
In my experience with similar functions, sometimes the regex functions return a list, so you have to consider that in the apply - I usually do an example manually
Also apply will not always in y experience on its own return something that always works into a data.frame,sometimes I had to use lap ply, and or unlist and data.frame to modify it
Here is an answer - I am not familiar with data.tables and I was having issues with the filenames being in a list, but with some transformations this works. I worked it out by seeing what apply was outputting and adding the [1] to get the piece I needed
DT <- data.frame(
id=c("432-12", "3242-12", "324-32", "322-34", "2324-34"),
fShort=c("1245", "453543", "45543", "45343", "13543"))
filenames <- list("3242-124342345.png", "432-124343.png", "135-13434.jpeg")
filenames1 <- unlist(filenames)
x<-apply(DT[1],1,function(x) grep(x,filenames1)[1])
DT$fielname <- filenames1[x]

Removing duplicates from the data

I already loaded 20 csv files with function:
tbl = list.files(pattern="*.csv")
list_of_data = lapply(tbl, read.csv)
I combined all of those filves into one:
all_data = do.call(rbind.fill, list_of_data)
In the new table is a column called "Accession". After combining many of the names (Accession) are repeated. And I would like to remove all of the duplicates.
Another problem is that some of those "names" are ALMOST the same. The difference is that there is name and after become the dot and the number.
Let me show you how it looks:
AT3G26450.1 <--
AT5G44520.2
AT4G24770.1
AT2G37220.2
AT3G02520.1
AT5G05270.1
AT1G32060.1
AT3G52380.1
AT2G43910.2
AT2G19760.1
AT3G26450.2 <--
<-- = Same sample, different names. Should be treated as one. So just ignore dot and a number after.
Tried this one:
all_data$CleanedAccession = str_extract(all_data$Accession, "^[[:alnum:]]+")
all_data = subset(all_data, !duplicated(CleanedAccession))
Error in `$<-.data.frame`(`*tmp*`, "CleanedAccession", value = character(0)) :
You can use this command to both subset and rename the values:
subset(transform(alldata, Ascension = sub("\\..*", "", Ascension)),
!duplicated(Ascension))
Ascension
1 AT3G26450
2 AT5G44520
3 AT4G24770
4 AT2G37220
5 AT3G02520
6 AT5G05270
7 AT1G32060
8 AT3G52380
9 AT2G43910
10 AT2G19760
What about
df <- data.frame( Accession = c("AT3G26450.1",
"AT5G44520.2",
"AT4G24770.1",
"AT2G37220.2",
"AT3G02520.1",
"AT5G05270.1",
"AT1G32060.1",
"AT3G52380.1",
"AT2G43910.2",
"AT2G19760.1",
"AT3G26450.2"))
df[!duplicated(unlist(lapply(strsplit(as.character(df$Accession),
".", fixed = T), "[", 1))), ]