date specific manipulation in a list in R - list

I have a list in lists
list1 <- list()
list1$date <- c("01/06/2002", "02/06/2002", "03/06/2002",
"04/06/2002", "05/06/2002", "01/07/2002", "19/07/2002", "11/07/2002",
"15/07/2002", "29/07/2002", "03/07/2002")
list1$value1 <- c(100,200,300,100,200,300,100,200,300,100,200)
I am trying to scale the "value1" which is maximum during the first week and also the last 2 days of the month. That is:
if the value is in between the dates 01 and 07 - only the maximum of the value must be doubled
If the date is >=28 then also the value needs to be doubled
Is there way where I can do this?

The lubridate package provides a variety of convenient date functions
library(lubridate)
list1 <- list()
list1$date <- c("01/06/2002", "02/06/2002", "03/06/2002",
"04/06/2002", "05/06/2002", "01/07/2002", "19/07/2002", "11/07/2002",
"15/07/2002", "29/07/2002", "03/07/2002")
list1$value1 <- c(100,200,300,100,200,300,100,200,300,100,200)
The list1$date object are strings, to use lubrdate's dmy (for day-month-year) to convert into a Date class, and then use the day() function to extract the numeric date of the month.
Assign to the doubles variable the dates that are in the first weeek (ie, less than day 7) or after day 28.
first7 <- day(dmy(list1$date)) <= 7
after28 <- day(dmy(list1$date)) >= 28
doubles <- (first7 & list1$value1 == max(list1$value1[first7],na.rm=T)) | after28
Assign to a coefficients variable the values that meet the doubles criteria and those that do not (simply multiply by 1).
coefficients <- ifelse(doubles,2,1)
Multiply the list1$value by the coefficients to get the required result
list1$value1 * coefficients

Related

Save results for each file of a list of files looping through a factor variable in R. Vector does not update

I am using a list of files, and I am trying to create a data frame that contains: for each sample, the percentage of two particular "GT" types by the levels of another factor variable called "chr" (with 1 to 24 levels).
It would have to look like this:
The problem I keep getting is that the vector never gets updated for the ith sample, it only keeps the first vector created. And then I am not sure how to save that updated vector on my data frame (df).
vector_chr <- vector();
for (i in seq_along(list_files)) {
GT <- list_files[[i]][,9]
chr <- list_files[[i]][,3]
GT$chr <- chr$chr # creating one df with both GT and chr
for (j in unique(GT$chr)){
dat_list = split(GT, GT$chr) # split data frames by chr (1 to 24)
table <- table(dat_list[[j]][,1]) # take GT and make a table
sum <- sum(table[3:4]) # sum GTs 3 and 4
perc <- sum/nrow(GT)
vector_chr <- c(vector_chr,perc) # assign the 24 percentages to a vector
}
df <- data.frame(matrix(ncol = 25, nrow = length(files)))
x <- c("Sample", "chr1", "chr2", "chr3",
"chr4", "chr5", "chr6", "chr7", "chr8", "chr9", "chr10",
"chr11", "chr12","chr13", "chr14", "chr15", "chr16",
"chr17", "chr18", "chr19", "chr20", "chr21", "chr22",
"chrX", "chrXY")
colnames(df) <- x
df$Sample <- names(list_files)
df[i,2:25] <- vector_chr # assign the 24 percentages for EACH sample
}

k-fold cross validation: how to filter data based on a randomly generated integer variable in Stata

The following seems obvious, yet it does not behave as I would expect. I want to do k-fold cross validation without using SCC packages, and thought I could just filter my data and run my own regressions on the subsets.
First I generate a variable with a random integer between 1 and 5 (5-fold cross validation), then I loop over each fold number. I want to filter the data by the fold number, but using a boolean filter fails to filter anything. Why?
Bonus: what would be the best way to capture all of the test MSEs and average them? In Python I would just make a list or a numpy array and take the average.
gen randint = floor((6-1)*runiform()+1)
recast int randint
forval b = 1(1)5 {
xtreg c.DepVar /// // training set
c.IndVar1 ///
c.IndVar2 ///
if randint !=`b' ///
, fe vce(cluster uuid)
xtreg c.DepVar /// // test set, needs to be performed with model above, not a
c.IndVar1 /// // new model...
c.IndVar2 ///
if randint ==`b' ///
, fe vce(cluster uuid)
}
EDIT: Test set needs to be performed with model fit to training set. I changed my comment in the code to reflect this.
Ultimately the solution to the filtering issue was I was using a scalar in quotes to define the bounds and I had:
replace randint = floor((`varscalar'-1)*runiform()+1)
instead of just
replace randint = floor((varscalar-1)*runiform()+1)
When and where to use the quotes in Stata is confusing to me. I cannot just use varscalar in a loop, I have to use `=varscalar', but I can for some reason use varscalar - 1 and get the expected result. Interestingly, I cannot use
replace randint = floor((`varscalar')*runiform()+1)
I suppose I should just use
replace randint = floor((`=varscalar')*runiform()+1)
So why is it ok to use the version with the minus one and without the equals sign??
The answer below is still extremely helpful and I learned much from it.
As a matter of fact, two different things are going on here that are not necessarily directly related. 1) How to filter data with a randomly generated integer value and 2) k-fold cross-validation procedure.
For the first one, I will leave an example below that could help you work things out using Stata with some tools that can be easily transferable to other problems (such as matrix generation and manipulation to store the metrics). However, I would call neither your sketch of code nor my example "k-fold cross-validation", mainly because they fit the model, both in the testing and in training data. Nonetheless, the case should be that strictly speaking, the model should be trained in the training data, and using those parameters, assess the performance of the model in testing data.
For further references on the procedure Scikit-learn has done brilliant work explaining it with several visualizations included.
That being said, here is something that could be helpful.
clear all
set seed 4
set obs 100
*Simulate model
gen x1 = rnormal()
gen x2 = rnormal()
gen y = 1 + 0.5 * x1 + 1.5 *x2 + rnormal()
gen byte randint = runiformint(1, 5)
tab randint
/*
randint | Freq. Percent Cum.
------------+-----------------------------------
1 | 17 17.00 17.00
2 | 18 18.00 35.00
3 | 21 21.00 56.00
4 | 19 19.00 75.00
5 | 25 25.00 100.00
------------+-----------------------------------
Total | 100 100.00
*/
// create a matrix to store results
matrix res = J(5,4,.)
matrix colnames res = "R2_fold" "MSE_fold" "R2_hold" "MSE_hold"
matrix rownames res ="1" "2" "3" "4" "5"
// show formated empty matrix
matrix li res
/*
res[5,4]
R2_fold MSE_fold R2_hold MSE_hold
1 . . . .
2 . . . .
3 . . . .
4 . . . .
5 . . . .
*/
// loop over different samples
forvalues b = 1/5 {
// run the model using fold == `b'
qui reg y x1 x2 if randint ==`b'
// save R squared training
matrix res[`b', 1] = e(r2)
// save rmse training
matrix res[`b', 2] = e(rmse)
// run the model using fold != `b'
qui reg y x1 x2 if randint !=`b'
// save R squared training (?)
matrix res[`b', 3] = e(r2)
// save rmse testing (?)
matrix res[`b', 4] = e(rmse)
}
// Show matrix with stored metrics
mat li res
/*
res[5,4]
R2_fold MSE_fold R2_hold MSE_hold
1 .50949187 1.2877728 .74155365 1.0070531
2 .89942838 .71776458 .66401888 1.089422
3 .75542004 1.0870525 .68884359 1.0517139
4 .68140328 1.1103964 .71990589 1.0329239
5 .68816084 1.0017175 .71229925 1.0596865
*/
// some matrix algebra workout to obtain the mean of the metrics
mat U = J(rowsof(res),1,1)
mat sum = U'*res
/* create vector of column (variable) means */
mat mean_res = sum/rowsof(res)
// show the average of the metrics acros the holds
mat li mean_res
/*
mean_res[1,4]
R2_fold MSE_fold R2_hold MSE_hold
c1 .70678088 1.0409408 .70532425 1.0481599
*/

Applying Rcpp on a dataframe

I'm new to C++ and exploring faster computation possibilities on R through the Rcpp package. The actual dataframe contains over ~2 million rows, and is quite slow.
Existing Dataframes
Main Dataframe
df<-data.frame(z = c("a","b","c"), a = c(303,403,503), b = c(203,103,803), c = c(903,803,703))
Cost Dataframe
cost <- data.frame("103" = 4, "203" = 5, "303" = 6, "403" = 7, "503" = 8, "603" = 9, "703" = 10, "803" = 11, "903" = 12)
colnames(cost) <- c("103", "203", "303", "403", "503", "603", "703", "803", "903")
Steps
df contains z which is a categorical variable with levels a, b and c. I had done a merge operation from another dataframe to bring in a,b,c into df with the specific nos.
First step would be to match each row in z with the column names (a,b or c) and create a new column called 'type' and copy the corresponding number.
So the first row would read,
df$z[1] = "a"
df$type[1]= 303
Now it must match df$type with column names in another dataframe called 'cost' and create df$cost. The cost dataframe contains column names as numbers e.g. "103", "203" etc.
For our example, df$cost[1] = 6. It matches df$type[1] = 303 with cost$303[1]=6
Final Dataframe should look like this - Created a sample output
df1 <- data.frame(z = c("a","b","c"), type = c("303", "103", "703"), cost = c(6,4,10))
A possible solution, not very elegant but does the job:
library(reshape2)
tmp <- cbind(cost,melt(df)) # create a unique data frame
row.idx <- which(tmp$z==tmp$variable) # row index of matching values
col.val <- match(as.character(tmp$value[row.idx]), names(tmp) ) # find corresponding values in the column names
# now put all together
df2 <- data.frame('z'=unique(df$z),
'type' = tmp$value[row.idx],
'cost' = as.numeric(tmp[1,col.val]) )
the output:
> df2
z type cost
1 a 303 6
2 b 103 4
3 c 703 10
see if it works

Aggregate modis list files by month

I am looking for a more efficient way of separating each year from the time series (2002-2016) by month. I've done it by hand, but it takes a lot.
mypath<-"D:/SNOWL"
myras<-list.files(path=mypath,pattern = glob2rx("*.tif$"),
full.names = TRUE, recursive = TRUE)
> myras
[1] "D:/SNOWL/MOYDSL10A1.A2002001.tif" "D:/SNOWL/MOYDSL10A1.A2002002.tif"
[3] "D:/SNOWL/MOYDSL10A1.A2002003.tif" "D:/SNOWL/MOYDSL10A1.A2002004.tif"
[5] "D:/SNOWL/MOYDSL10A1.A2002005.tif" "D:/SNOWL/MOYDSL10A1.A2002006.tif"
[7] "D:/SNOWL/MOYDSL10A1.A2002007.tif" "D:/SNOWL/MOYDSL10A1.A2002008.tif"
[9] "D:/SNOWL/MOYDSL10A1.A2002009.tif" "D:/SNOWL/MOYDSL10A1.A2002010.tif"
[11] "D:/SNOWL/MOYDSL10A1.A2002011.tif" "D:/SNOWL/MOYDSL10A1.A2002012.tif"
serie<-orgTime(myras, nDays = "asIn", begin ="2002-01-01",end = "2016-12-31", pillow = 75, pos1 = 13, pos2 = 19)
filter<-serie$inputLayerDates
> filter
[1] "2002-01-01" "2002-01-02" "2002-01-03" "2002-01-04" "2002-01-05"
[6] "2002-01-06" "2002-01-07" "2002-01-08" "2002-01-09" "2002-01-10"
[11] "2002-01-11" "2002-01-12" "2002-01-13" "2002-01-14" "2002-01-15"
[16] "2002-01-16" "2002-01-17" "2002-01-18" "2002-01-19" "2002-01-20"
[21] "2002-01-21" "2002-01-22" "2002-01-23" "2002-01-24" "2002-01-25"
[26] "2002-01-26" "2002-01-27" "2002-01-28" "2002-01-29" "2002-01-30"
[31] "2002-01-31" "2002-02-01" "2002-02-02" "2002-02-03" "2002-02-04"
[36] "2002-02-05" "2002-02-07" "2002-02-08" "2002-02-09" "2002-02-10"
[41] "2002-02-11" "2002-02-12" "2002-02-13" "2002-02-14" "2002-02-15"
EDIT:
Ok, let's try a full size example and see if it's working for you:
# Here we generate filenames as returned from `list.files`:
rm(list = ls())
myras <- sapply(1:5465, function(i) paste0('D:/SNOWL/MOYDSL10A1.A',sample(2000:2016,1),sample(c(paste0('00',1:9),paste0('0',10:99),100:365),1),'.tif'))
head(myras)
# Let's extract the timestamps
tstmps <- regmatches(myras,regexpr('[[:digit:]]{7}',myras))
head(tstmps,50)
# And now convert the timestamps to dates
dates <- as.Date(as.numeric(substr(tstmps,5,7)) - 1, origin = paste0(substr(tstmps,1,4),"-01-01"))
head(dates,10)
# Last step is to sort the files by month
#check months
print(month.name)
myras_byM = sapply(month.name,function(x) myras[months(dates) == x])
head(myras_byM$January)
head(myras_byM$February)
head(myras_byM$March)
head(myras_byM$April)
head(myras_byM$May)
head(myras_byM$June)
head(myras_byM$July)
head(myras_byM$August)
head(myras_byM$September)
head(myras_byM$October)
head(myras_byM$November)
head(myras_byM$December)
You can easily get the date from your filename, if you have a consistent naming convention.
In your case, I see the files are ordered by year and day of the year. So just strip the date from the filename, and then you can filter it by whatever you need. To do this I'm using regular expressions. In this case, I'm interested in the date and DOY string, which should always be 7 numbers. The corresponding RE is therefore [[:digit:]]{7}, which means 7 consecutive digits. regexpr finds the matches and regmatches returns them.
dts <- regmatches(myras,regexpr('[[:digit:]]{7}',myras))
Then you just use substring to extract the digits you need (this method assumes it's always 4 digits for year followed by 3 for DOY) and convert it to a date:
dts <-as.Date(as.numeric(substr(dts,5,7)) - 1, origin = paste0(substr(dts,1,4),"-01-01"))
That would give you the variable of filter you have in your example.
If you then want to sort the entire time series by month, you could use sapply or lapply with the built-in names month.name. The base function months will return you the name of the month for a given date:
myras_byMonth <- sapply(month.name,function(x) myras[months(dts) == x])
Hope I understood your question correctly and this was what you were looking for.
Best,
Val

Use carryforward with dynamic condition to limit carry forward time interval

I am using carryforward (ssc install carryforward) to fill in missing observations. Some of my data are annual and I want to use them for subsequent monthly observations, but only if the carried forward data are less than two years old. Can I achieve this logic with the dynamic_condition() option, particularly using #? I have to complete this for many variables, and would like to avoid a lot of variable generation and dropping (and really I'd like to know if it's possible).
The following "manual" solution works, but can I replicate it on the fly with dynamic_condition()? My attempts below fail.
/* generate data with observation every June */
clear
set obs 100
generate date_ym = ym(2001, 1) + (_n - 1)
format date_ym %tm
generate date_m = month(dofm(date_ym))
generate x = runiform() if (date_m == 6) & !inlist(_n, 30, 42)
/* carryforward (ssc install carryforward), "manual" solution */
egen date_m2 = group(date_ym) if !missing(x)
carryforward date_m2, replace
bysort date_m2 (date_ym): generate date_m3 = cond(_n > 24, ., date_m2)
carryforward x if !missing(date_m3), gen(x_cf)
tsset date_ym
list, sep(12)
/* can I replicate this with dynamic_condition() option? */
/* no time series operators with # */
/* carryforward x, gen(x_cf2) dynamic_condition(sum(d.# == 0) < 24) */
/* x_cf2: d.x_cf2 invalid name */
/* second # doesn't work */
/* carryforward x, gen(x_cf3) dynamic_condition(sum(# == #[_n - 1]) < 24) */
/* x_cf3: equation [_n-1] not found */
Disclosure: I don't use carryforward (SSC), but that's because I tend to think back to the principles as I understand them, as documented here.
To do this, you need to keep a record not only of previous non-missing values but also of the dates when a variable was last not missing. This arose previously: see this answer
The essence of a simpler approach is here:
clear
set seed 2803
set obs 100
generate date_ym = ym(2001, 1) + (_n - 1)
format date_ym %tm
generate x = runiform() if inlist(_n, 30, 42)
gen last = date_ym if !missing(x)
replace last = last[_n-1] if missing(last)
replace x = x[_n-1] if missing(x) & (date_ym - last) < 24
The generalisation to panels is using by: and the generalisation to multiple variables uses a foreach loop. If the dates of missing values can be different for different variables, that mostly just shifts the loop.
Schematically, suppose we are cycling over an arbitary varlist and that the dates of missing values differ, but we use the rule of using the last value within 24 months.
gen last = .
quietly foreach v of varlist <varlist> {
replace last = cond(!missing(`v'), date_ym, .)
replace last = last[_n-1] if missing(last)
replace `v' = `v'[_n-1] if missing(`v') & (date_ym - last) < 24
}