My question's title may be a little bit ambiguous.
Previously, I wanted to "acquire complete list of subdirs" and then read the files in these subdirs into Stata (see this post and this post).
Thanks to #Roberto Ferrer's great suggestion, I almost manage to do this. But I encountered another problem then. Because I have so many separate files, the length of local macro seems to hit its upper bound. After the command local n: word count Stata sends an error message:
macro substitution results in line that is too long.
The line resulting from substituting macros would be longer than allowed. The maximum allowed length is 645,216 characters, which is calculated on the basis of set maxvar. You can change that in Stata/SE and Stata/MP. What follows is relevant only if you are using Stata/SE or Stata/MP.
The maximum line length is defined as 16 more than the maximum macro length, which is currently 645,200 characters. Each unit increase in set maxvar increases the length maximums by 129.The maximum value of set maxvar is 32,767. Thus, the maximum line length may be set up to 4,227,159 characters if you set maxvar to its largest value.
r(920);
When I reduce the number of subdirs to 5, Stata works fine. Since having roughly 100 subdirs, I suppose to replicate the actions for 20 times. Well, it's manageable, but I still want to know if I can fully automate this process , more specifically, to "exhaust" the max allowable macro length,import the files and add another group of subdirs next time .
Below you can find my code:
//====================================
//=== read and clean projects data ===
//====================================
version 14
set linesize 80
set more off
clear
macro drop _all
set linesize 200
cd G:\Data_backup\Soufang_data
*----------------------------------
* Read all files within dictionary
*----------------------------------
* Import the first worksheets 1:"项目首页" 2:"项目概况" 3:"成交详情"
* worksheet1
filelist, directory("G:\Data_backup\Soufang_data") pattern(*.xlsx)
* Add pattern(*.xlsx) provent importing add file type( .doc or .dta)
gen tag = substr(reverse(dirname),1,6) == "esuoh/"
keep if tag==1
gen path = dirname+"\"+filename
qui valuesof path if tag==1
local filelist = r(values)
split dirname, parse("\" "/")
ren dirname4 citylist
drop dirname1-dirname3 dirname5
qui valuesof citylist if tag==1
local city = r(values)
local count = 1
local n:word count `filelist'
forval i = 1/`n' {
local file : word `i' of `filelist'
local cityname: word `i' of `city'
** don't add xlsx after `file', suffix has been added
** write "`file'" rather than `file', I don't know why but it works
qui import excel using "`file'",clear
cap qui sxpose,clear
cap qui drop in 1/1
gen city = "`cityname'"
if `count'==1 {
save house.dta,replace emptyok
}
else {
qui append using house
qui save house.dta,replace emptyok
}
local ++count
}
Thank you.
You do not need to store the whole list of files in a macro. filelist creates a database of files that you want to work with. Just save it and reload it for each file you want to process. You also use a very inefficient way to append datasets. As the appended dataset grows, the cost of reloading and saving it become very high and can slow down the whole process to a crawl.
Here's a sketch of how to process your Excel files
filelist, directory(".") pattern(*.xlsx)
save "myfiles.dta", replace
local n = _N
forval i = 1/`n' {
use in `i' using "myfiles.dta", clear
local f = dirname + "/" + filename
qui import excel using "`f'",clear
tempfile res`i'
save "`res`i''"
}
clear
forval i = 1/`n' {
append using "`res`i''"
}
save "final.dta", replace
Related
I have problems combining multiple for loops. I will give an example with two of them, I would like to combine. If I know how to do it with two I will also be able to do it with multiple loops.
If anyone knows how to write this as lapply function that would also be nice.
require(ncdf4)
#### download files from this link to directory: (I just downloaded manually,two files are sufficient to answer the example)
#### ftp://rfdata:forceDATA#ftp.iiasa.ac.at/WFDEI/LWdown_daily_WFDEI/
setwd("C:/place_where_I_have_downloaded_my_files_from_link/")
temp = list.files(pattern="*.nc") #list imported netcdf files
list2env(
lapply(setNames(temp, make.names(gsub("*.nc$", "", temp))),
nc_open), envir = .GlobalEnv) #import all parameters lists to global environment
#### first loop - # select parameter out of netcdf files and combine into a List of 2
list_temp<-list() #create empty list before loop
for (t in temp[1:2]){
list_temp[t]<-list(data.frame(LWdown=ncvar_get(nc_open(t),"LWdown")[428,176,],xcoor=176,ycoor=428))
}
LW_bind<-do.call(rbind,list_temp)
rownames(LWdown_1to2)<-NULL
#### second loop # select parameter out of onenetcdf file per x-coordinate and combine into a List of 2
list_temp<-list() #create empty list before loop
for (x in 176:177){
list_temp[t]<-list(data.frame(LWdown=ncvar_get(nc_open(temp[1]),"LWdown")[428,x,],xcoor=x,ycoor=428))
}
LW_bind<-do.call(rbind,list_temp)
rownames(LWdown_1to2)<-NULL
How I tried to combine but didn't work:
#### combined loops
list_temp<-list()
for (t in temp[1:2]){for (x in 176:177){
#ncin<-list()
ncin<-nc_open(t)
list_temp[x][t]<-list(data.frame(LWdown=ncvar_get(ncin,"LWdown")[428,x,],x=x,y=428))
}}
LWdown_1to2<-do.call(rbind,list_temp)
rownames(LWdown_1to2)<-NULL
I already solved my problem. See below. But I am still curious how one could solve the two for loops as described above, so I will leave the question open an unanswered.
Here is my solution:
require(arrayhelpers);require(stringr);require(plyr);require(ncdf4)
# store all files from ftp://rfdata:forceDATA#ftp.iiasa.ac.at/WFDEI/ in the following folder:
setwd("C:/folder")
temp = list.files(pattern="*.nc") #list all the file names
param<-gsub("_\\S+","",temp,perl=T) #extract parameter from file name
xcoord=seq(176,180,by=1) #The X-coordinates you are interested in
ycoord=seq(428,433,by=1) #The Y-coordinates you are interested in
list_var<-list() # make an empty list
for (t in 1:length(temp)){
temp_year<-str_sub(temp[],-9,-6) #take string number last place minus 9 till last place minus 6 to extract the year from file name
temp_month<-str_sub(temp[],-5,-4) #take string number last place minus 9 till last place minus 6 to extract the month from file name
temp_netcdf<-nc_open(temp[t])
temp_day<-rep(seq(1:length(ncvar_get(temp_netcdf),"day"))),length(xcoord)*length(ycoord)) # make a string of day numbers the same length as amount of values
dim.order<-sapply(temp_netcdf[["var"]][[param[t]]][["dim"]],function(x) x$name) # gives the name of each level of the array
start <- c(lon = 428, lat = 176, tstep = 1) # indicates the starting value of each variable
count <- c(lon = 6, lat = 5, tstep = length(ncvar_get(temp_netcdf,"day"))) # indicates how many values of each variable have to be present starting from start
tempstore<-ncvar_get(temp_netcdf, param[t], start = start[dim.order], count = count[dim.order]) # array with parameter values
df_temp<-array2df (tempstore, levels = list(lon=ycoord, lat = xcoord, day = NA), label.x = "value") # convert array to dataframe
Add_date<-sort(as.Date(paste(temp_year[t],"-",temp_month[t],"-",temp_day,sep=""),"%Y-%m-%d"),decreasing=FALSE) # make vector with the dates
list_var[t]<-list(data.frame(Add_date,df_temp,parameter=param[t])) #add dates to data frame and store in a list of all output files
### nc_close(temp_netcdf) #close nc file to prevent data loss and errors
}
All_NetCDF_var_in1df<-do.call(rbind,list_var)
I have thousands of city folders (for example city1, city2, and so on, but in reality named like NewYork, Boston, etc.). Each folder further contains two subfolders: land and house.
So the directory structure is like:
current dictionary
---- city1
----- house
------ many .xlsx files
----- land
----- city2
----- city3
···
----- city1000
I want to get the complete list of all subdirs and do some manipulation (like import excel). I know there is a macro extended function: local list: dir to handle this issue, but it seems it can only return the first tier of subdirs, like city_i, rather than those deeper ones.
More specifically, if I want to take action within all house folders, what kind of workflow do I need?
I have made an initial attempt to write code to achieve my goal:
cd G:\Data_backup\Soufang_data
local folder: dir . dirs "*"
foreach i of local folder {
local `i'_house : dir "G:\Data_backup\Soufang_data\``i''\house" files "*.xlsx"
local count = 1
foreach j of local `i'_house {
cap import excel "`j'",clear
cap sxpose,clear
cap drop in 1/1
if `count'==1 {
save `i'.dta, replace
}
else {
cap qui append using `i'
save `i'.dta,replace
}
local ++count
}
}
There is something wrong with:
``i''
in the dir, I struggled to make it work without success, anyway.
I have another post on this project.
Supplementary remarks:
As Nick points out, it's the back slash that causes the trouble. Moving from that point, however, I encounter another problem. Say, without the complicated actions, I just want to test if my loops work, so I write the following code snippet:
set more off
cd G:\Data_backup\Soufang_data
local folder: dir . dirs "*"
foreach i of local folder {
di "`i'"
local `i'_house : dir "G:\Data_backup\Soufang_data/`i'\house" files "*.xlsx"
foreach j of local `i'_house {
di "`j'"
}
}
However, the outcome on the screen is something like:
city1
project100
project99
······
project1
It seems the code only loops one round, over the first city, but fails to come to city2, city3 and so on. I suspect it's due to my problematic writing of the local, especially in this line but I'm not sure:
foreach j of local `i'_house
Although not a solution to whatever problem you're actually presenting, an easier way might be to use filelist, from SSC (ssc install filelist).
An example might be:
. // list all files
. filelist, directory("D:\Datos\RFERRER\Desktop\example")
Number of files found = 5
.
. // strange way of tagging directories ending in "\house"
. // change at will
. gen tag = substr(reverse(dirname),1,6) == "esuoh/"
.
. order tag
. list
+----------------------------------------------------------------------------------------------+
| tag dirname filename fsize |
|----------------------------------------------------------------------------------------------|
1. | 0 D:\Datos\RFERRER\Desktop\example/proj_1 newfile.txt 0 |
2. | 1 D:\Datos\RFERRER\Desktop\example/proj_2/house somefile.txt 0 |
3. | 0 D:\Datos\RFERRER\Desktop\example/proj_3/subproj_3_2 newfile2.txt 0 |
4. | 1 D:\Datos\RFERRER\Desktop\example/proj_3/subproj_3_2/house anothernewfile.txt 0 |
5. | 1 D:\Datos\RFERRER\Desktop\example/proj_3/subproj_3_2/house someotherfile.txt 0 |
+----------------------------------------------------------------------------------------------+
Afterwards, use keep or drop, conditional on variable tag.
Graphically, the directory looks like:
(I'm on Stata 13. Check help string functions for other ways to tag.)
Your revised problem may yield to
local folder: dir . dirs "*"
foreach i of local folder {
di "`i'"
local house : dir "G:\Data_backup\Soufang_data/`i'\house" files "*.xlsx"
foreach j of local house {
di "`j'"
}
}
but clearly we can't see your file structure or file names.
I've got a number of files that contain gene expression data. In each file, the gene name is kept in a column "Gene_symbol" and the expression measure (a real number) is kept in a column "RPKM". The file name consists of an identifier followed by _ and the rest of the name (ends with "expression.txt"). I would like to load all of these files into R as data frames, for each data frame rename the column "RPKM" with the identifier of the original file and then join the data frames by "Gene_symbol" into one large data frame with one column "Gene_symbol" followed by all the columns with the expression measures from the individual files, each labeled with the original identifier.
I've managed to transfer the identifier of the original files to the names of the individual data frames as follows.
files <- list.files(pattern = "expression.txt$")
for (i in files) {var_name = paste("Data", strsplit(i, "_")[[1]][1], sep = "_"); assign(var_name, read.table(i, header=TRUE)[,c("Gene_symbol", "RPKM")])}
So now I'm at a stage where I have dataframes as follows:
Data_id0001 <- data.frame(Gene_symbol=c("geneA","geneB","geneC"),RPKM=c(2.43,5.24,6.53))
Data_id0002 <- data.frame(Gene_symbol=c("geneA","geneB","geneC"),RPKM=c(4.53,1.07,2.44))
But then I don't seem to be able to rename the RPKM column with the id000x bit. (That is in a fully automated way of course, looping through all the data frames I will generate in the real scenario.)
I've tried to store the identifier bit as a comment with the data frames but seem to be unable to assign the comment from within a loop.
Any help would be appreciated,
mce
You should never work this way in R. You should always try keeping all your data frames in a list and operate over them using function such as lapply etc. Thus, instead of using assign, just create an empty list of length of your files list and fill it with the for loop
For your current situation, we can fixed it using ls and mget combination in order to pull this data frames from the global environment into a list and then change the columns of interest.
temp <- mget(ls(pattern = "Data_id\\d+$"))
lapply(names(temp), function(x) names(temp[[x]])[2] <<- gsub("Data_", "", x))
temp
#$Data_id0001
# Gene_symbol id0001
# 1 geneA 2.43
# 2 geneB 5.24
# 3 geneC 6.53
#
# $Data_id0002
# Gene_symbol id0002
# 1 geneA 4.53
# 2 geneB 1.07
# 3 geneC 2.44
You could eventually use list2env in order to get them back to the global environment, but you should use with caution
thanks a lot for your suggestions! I think I get the point. The way I'm doing it now (see below) is hopefully a lot more R-like and works fine!!!
Cheers,
Maik
library(plyr)
files <- list.files(pattern = "expression.txt$")
temp <- list()
for (i in 1:length(files)) {temp[[i]]=read.table(files[i], header=TRUE)[,c("Gene_symbol", "RPKM")]}
for (i in 1:length(temp)) {temp[[i]]=rename(temp[[i]], c("RPKM"=strsplit(files[i], "_")[[1]][1]))}
combined_expression <- join_all(temp, by="Gene_symbol", type="full")
I have some survey data which I'm using Stata to analyze. I want to compute means of one variable by group and save those means to a Stata file. My code looks like this:
svyset [iw=wtsupp], sdrweight(repwtp1-repwtp160) vce(sdr)
svy: mean x
I tried
svy: by grp: mean x
but that did not work. I could save each mean to a separate file by simply saying
svy: mean x if grp==1
but that's inefficient. Is there a better way?
Saving results to a file like one can use SAS ODS to capture results is also a need. I am not talking about the log here. I need the means and the associated group. I'm thinking
estimates save [path],replace
but I'm not sure if that will give me a Stata file or the group if I can figure out how to use by processing.
Here's a simpler approach that creates a data set of the displayed estimation results: estimated means, standard errors, confidence limits, z statistics, and p-values. svy: mean is called with the over() option, which does away with the need for the foreach loop and computes standard errors appropriate for subpopulation analysis. The estimation results are contained in the returned matrix r(table), which is converted by the svmat command to a Stata data set. While svmat maintains column names, it does not preserve row (group) names, so it is necessary to merge these in to the created data set.
set more off
use http://www.stata-press.com/data/r13/ss07ptx, clear
svyset _n [pw= pwgtp], sdrweight(pwgtp*) vce(sdr)
************************************************ *
* Set name of grouping variable in double quotes *
* in the next line. *
* ************************************************
local gpname "sex"
tempvar gp
egen `gp' = group(`gpname')
preserve
tempfile t1
bys `gp': keep if _n==1
keep `gp' `gpname'
save `t1'
restore
svy: mean agep , over(`gp')
matrix a = r(table)'
clear
qui svmat double a, names(col)
gen `gp'=_n
merge 1:1 `gp' using `t1'
keep `gpname' b se z pvalue ll ul
order `gpname'
save results, replace
list
Edited 10/28
This version contains legibility improvements and the outcome variable and saved datasets are specified in a local macro. Therefore the analyst need not touch the foreach block. Easier to write and read matrix subscript expressions are used instead of the el matrix function: thus m[1,1] instead of el("m",1,1).
sysuse auto, clear
svyset _n
************************************************ *
* Set names of grouping variable and results data *
* set in double quotes in the next line. *
* ************************************************
local yvar mpg // variable for mean
local gpname "foreign"
local d_results "results"
tempvar gp
gen `gp' = `gpname'
tempname memhold
postfile `memhold' ///
`gpname' n mean se sd using `d_results', replace
levelsof `gp', local(lg)
foreach x of local lg{
svy, subpop(if `gp'==`x'): mean `yvar'
matrix m = e(b)
matrix v = e(V)
matrix a = e(V_srssub)
matrix b = e(_N_subp)
matrix c = e(_N)
scalar gx = `x'
scalar mean = m[1,1]
scalar sem = sqrt(v[1,1])
scalar sd = sqrt(b[1,1]*a[1,1])
scalar n = c[1,1]
post `memhold' (gx) (n) (mean) (sem) (sd)
}
postclose `memhold'
use results, clear
list
I have 53 Stata .dta files each of them is 150 - 200 Mb and contain identical set of variables, but for different years. It is not useful to combine or merge them due to their size .
I need to retrieve some averaged values (percentages etc.) Therefore, I want to create a new Stata file New.dta and write a .do file that would run on that new Stata file in the following way: it should open each of those 53 Stata files, make certain calulations, and store the results in the new Stata file, New.dta.
I am not sure how i can keep two Stata file open simultaneuosly, and how can i store the calculated values?
When I open a second .dta file, how can i make the first one still be open? How can i store the calculated values in the global variable?
What springs to mind here is the use of postfile.
Here is a simple example. First, I set up an example of several datasets. You already have this.
clear
forval i = 1/10 {
set obs 100
gen foo = `i' * runiform()
save test`i'
clear
}
Now I set up postfile. I need to set up a handle, what variables will be used, and what file will be used. Although I am using a numeric variable to hold file identifiers, it will perhaps be more typical to use a string variable. Also, looping over filenames may be a bit more challenging than this. fs from SSC is a convenience command that helps put a set of filenames into a local macro; its use is not illustrated here.
postfile mypost what mean using alltest.dta
forval i = 1/10 {
use test`i', clear
su foo, meanonly
post mypost (`i') (`r(mean)')
}
Now flush results
postclose mypost
and see what we have.
u alltest
list
+-----------------+
| what mean |
|-----------------|
1. | 1 .5110765 |
2. | 2 1.016858 |
3. | 3 1.425967 |
4. | 4 2.144528 |
5. | 5 2.438035 |
|-----------------|
6. | 6 3.030457 |
7. | 7 3.356905 |
8. | 8 4.449655 |
9. | 9 4.381101 |
10. | 10 5.017308 |
+-----------------+
I didn't use any global macros (not global variables) here; you should not need to.
An alternative approach is to loop over files and use collapse to "condense" these files to the relevant means, and than append these condensed files. Here is an adaptation of Nick's example:
// create the example datasets
clear
forval i = 1/10 {
set obs 100
gen foo = `i' * runiform()
gen year = `i'
save test`i', replace
clear
}
// use collapse and append
// to create the dataset you want
use test1, clear
collapse (mean) year foo
save means, replace
forvalues i = 2/10 {
use test`i', clear
collapse (mean) year foo
append using means
save means, replace
}
// admire the result
list
Note that if your data sets are not named sequentially like test1.dta, test2.dta, ..., test53.dta, but rather like results-alaska.dta, result_in_alabama.dta, ..., "wyoming data.dta" (note the space and hence the quotes), you would have to organize the cycle over these files somewhat differently:
local allfiles : dir . files "*.dta"
foreach f of local allfiles {
use `"`f'"', clear
* all other code from Maarten's or Nick's approach
}
This is a more advanced of local macros, see help extended macro functions. Note also that Stata will produce a list that will look like "results-alaska.dta" "result_in_alabama.dta" "wyoming data.dta" with quotes around file names, so when you invoke use, you will have to enclose the file name into compound quotes.