How to retrieve data from multiple Stata files? - stata

I have 53 Stata .dta files each of them is 150 - 200 Mb and contain identical set of variables, but for different years. It is not useful to combine or merge them due to their size .
I need to retrieve some averaged values (percentages etc.) Therefore, I want to create a new Stata file New.dta and write a .do file that would run on that new Stata file in the following way: it should open each of those 53 Stata files, make certain calulations, and store the results in the new Stata file, New.dta.
I am not sure how i can keep two Stata file open simultaneuosly, and how can i store the calculated values?
When I open a second .dta file, how can i make the first one still be open? How can i store the calculated values in the global variable?

What springs to mind here is the use of postfile.
Here is a simple example. First, I set up an example of several datasets. You already have this.
clear
forval i = 1/10 {
set obs 100
gen foo = `i' * runiform()
save test`i'
clear
}
Now I set up postfile. I need to set up a handle, what variables will be used, and what file will be used. Although I am using a numeric variable to hold file identifiers, it will perhaps be more typical to use a string variable. Also, looping over filenames may be a bit more challenging than this. fs from SSC is a convenience command that helps put a set of filenames into a local macro; its use is not illustrated here.
postfile mypost what mean using alltest.dta
forval i = 1/10 {
use test`i', clear
su foo, meanonly
post mypost (`i') (`r(mean)')
}
Now flush results
postclose mypost
and see what we have.
u alltest
list
+-----------------+
| what mean |
|-----------------|
1. | 1 .5110765 |
2. | 2 1.016858 |
3. | 3 1.425967 |
4. | 4 2.144528 |
5. | 5 2.438035 |
|-----------------|
6. | 6 3.030457 |
7. | 7 3.356905 |
8. | 8 4.449655 |
9. | 9 4.381101 |
10. | 10 5.017308 |
+-----------------+
I didn't use any global macros (not global variables) here; you should not need to.

An alternative approach is to loop over files and use collapse to "condense" these files to the relevant means, and than append these condensed files. Here is an adaptation of Nick's example:
// create the example datasets
clear
forval i = 1/10 {
set obs 100
gen foo = `i' * runiform()
gen year = `i'
save test`i', replace
clear
}
// use collapse and append
// to create the dataset you want
use test1, clear
collapse (mean) year foo
save means, replace
forvalues i = 2/10 {
use test`i', clear
collapse (mean) year foo
append using means
save means, replace
}
// admire the result
list

Note that if your data sets are not named sequentially like test1.dta, test2.dta, ..., test53.dta, but rather like results-alaska.dta, result_in_alabama.dta, ..., "wyoming data.dta" (note the space and hence the quotes), you would have to organize the cycle over these files somewhat differently:
local allfiles : dir . files "*.dta"
foreach f of local allfiles {
use `"`f'"', clear
* all other code from Maarten's or Nick's approach
}
This is a more advanced of local macros, see help extended macro functions. Note also that Stata will produce a list that will look like "results-alaska.dta" "result_in_alabama.dta" "wyoming data.dta" with quotes around file names, so when you invoke use, you will have to enclose the file name into compound quotes.

Related

Using dictionary in regexp_replace function in pyspark

I want to perform an regexp_replace operation on a pyspark dataframe column using dictionary.
Dictionary : {'RD':'ROAD','DR':'DRIVE','AVE':'AVENUE',....}
The dictionary will have around 270 key value pair.
Input Dataframe:
ID | Address
1 | 22, COLLINS RD
2 | 11, HEMINGWAY DR
3 | AVIATOR BUILDING
4 | 33, PARK AVE MULLOHAND DR
Desired Output Dataframe:
ID | Address | Address_Clean
1 | 22, COLLINS RD | 22, COLLINS ROAD
2 | 11, HEMINGWAY DR | 11, HEMINGWAY DRIVE
3 | AVIATOR BUILDING | AVIATOR BUILDING
4 | 33, PARK AVE MULLOHAND DR | 33, PARK AVENUE MULLOHAND DRIVE
I cannot find any documentation on internet. And if trying to pass dictionary as below codes-
data=data.withColumn('Address_Clean',regexp_replace('Address',dict))
Throws an error "regexp_replace takes 3 arguments, 2 given".
Dataset will be around 20 million in size. Hence, UDF solution will be slow (due to row wise operation) and we don't have access to spark 2.3.0 which supports pandas_udf.
Is there any efficient method of doing it other than may be using a loop?
It is trowing you this error because regexp_replace() needs three arguments:
regexp_replace('column_to_change','pattern_to_be_changed','new_pattern')
But you are right, you don't need a UDF or a loop here. You just need some more regexp and a directory table that looks exactly like your original directory :)
Here is my solution for this:
# You need to get rid of all the things you want to replace.
# You can use the OR (|) operator for that.
# You could probably automate that and pass it a string that looks like that instead but I will leave that for you to decide.
input_df = input_df.withColumn('start_address', sf.regexp_replace("original_address","RD|DR|etc...",""))
# You will still need the old ends in a separate column
# This way you have something to join on your directory table.
input_df = input_df.withColumn('end_of_address',sf.regexp_extract('original_address',"(.*) (.*)", 2))
# Now we join the directory table that has two columns - ends you want to replace and ends you want to have instead.
input_df = directory_df.join(input_df,'end_of_address')
# And now you just need to concatenate the address with the correct ending.
input_df = input_df.withColumn('address_clean',sf.concat('start_address','correct_end'))

How to assign the maximum amount of strings to macro automatically?

My question's title may be a little bit ambiguous.
Previously, I wanted to "acquire complete list of subdirs" and then read the files in these subdirs into Stata (see this post and this post).
Thanks to #Roberto Ferrer's great suggestion, I almost manage to do this. But I encountered another problem then. Because I have so many separate files, the length of local macro seems to hit its upper bound. After the command local n: word count Stata sends an error message:
macro substitution results in line that is too long.
The line resulting from substituting macros would be longer than allowed. The maximum allowed length is 645,216 characters, which is calculated on the basis of set maxvar. You can change that in Stata/SE and Stata/MP. What follows is relevant only if you are using Stata/SE or Stata/MP.
The maximum line length is defined as 16 more than the maximum macro length, which is currently 645,200 characters. Each unit increase in set maxvar increases the length maximums by 129.The maximum value of set maxvar is 32,767. Thus, the maximum line length may be set up to 4,227,159 characters if you set maxvar to its largest value.
r(920);
When I reduce the number of subdirs to 5, Stata works fine. Since having roughly 100 subdirs, I suppose to replicate the actions for 20 times. Well, it's manageable, but I still want to know if I can fully automate this process , more specifically, to "exhaust" the max allowable macro length,import the files and add another group of subdirs next time .
Below you can find my code:
//====================================
//=== read and clean projects data ===
//====================================
version 14
set linesize 80
set more off
clear
macro drop _all
set linesize 200
cd G:\Data_backup\Soufang_data
*----------------------------------
* Read all files within dictionary
*----------------------------------
* Import the first worksheets 1:"项目首页" 2:"项目概况" 3:"成交详情"
* worksheet1
filelist, directory("G:\Data_backup\Soufang_data") pattern(*.xlsx)
* Add pattern(*.xlsx) provent importing add file type( .doc or .dta)
gen tag = substr(reverse(dirname),1,6) == "esuoh/"
keep if tag==1
gen path = dirname+"\"+filename
qui valuesof path if tag==1
local filelist = r(values)
split dirname, parse("\" "/")
ren dirname4 citylist
drop dirname1-dirname3 dirname5
qui valuesof citylist if tag==1
local city = r(values)
local count = 1
local n:word count `filelist'
forval i = 1/`n' {
local file : word `i' of `filelist'
local cityname: word `i' of `city'
** don't add xlsx after `file', suffix has been added
** write "`file'" rather than `file', I don't know why but it works
qui import excel using "`file'",clear
cap qui sxpose,clear
cap qui drop in 1/1
gen city = "`cityname'"
if `count'==1 {
save house.dta,replace emptyok
}
else {
qui append using house
qui save house.dta,replace emptyok
}
local ++count
}
Thank you.
You do not need to store the whole list of files in a macro. filelist creates a database of files that you want to work with. Just save it and reload it for each file you want to process. You also use a very inefficient way to append datasets. As the appended dataset grows, the cost of reloading and saving it become very high and can slow down the whole process to a crawl.
Here's a sketch of how to process your Excel files
filelist, directory(".") pattern(*.xlsx)
save "myfiles.dta", replace
local n = _N
forval i = 1/`n' {
use in `i' using "myfiles.dta", clear
local f = dirname + "/" + filename
qui import excel using "`f'",clear
tempfile res`i'
save "`res`i''"
}
clear
forval i = 1/`n' {
append using "`res`i''"
}
save "final.dta", replace

How to acquire complete list of subdirs (including subdirs of subdirs)?

I have thousands of city folders (for example city1, city2, and so on, but in reality named like NewYork, Boston, etc.). Each folder further contains two subfolders: land and house.
So the directory structure is like:
current dictionary
---- city1
----- house
------ many .xlsx files
----- land
----- city2
----- city3
···
----- city1000
I want to get the complete list of all subdirs and do some manipulation (like import excel). I know there is a macro extended function: local list: dir to handle this issue, but it seems it can only return the first tier of subdirs, like city_i, rather than those deeper ones.
More specifically, if I want to take action within all house folders, what kind of workflow do I need?
I have made an initial attempt to write code to achieve my goal:
cd G:\Data_backup\Soufang_data
local folder: dir . dirs "*"
foreach i of local folder {
local `i'_house : dir "G:\Data_backup\Soufang_data\``i''\house" files "*.xlsx"
local count = 1
foreach j of local `i'_house {
cap import excel "`j'",clear
cap sxpose,clear
cap drop in 1/1
if `count'==1 {
save `i'.dta, replace
}
else {
cap qui append using `i'
save `i'.dta,replace
}
local ++count
}
}
There is something wrong with:
``i''
in the dir, I struggled to make it work without success, anyway.
I have another post on this project.
Supplementary remarks:
As Nick points out, it's the back slash that causes the trouble. Moving from that point, however, I encounter another problem. Say, without the complicated actions, I just want to test if my loops work, so I write the following code snippet:
set more off
cd G:\Data_backup\Soufang_data
local folder: dir . dirs "*"
foreach i of local folder {
di "`i'"
local `i'_house : dir "G:\Data_backup\Soufang_data/`i'\house" files "*.xlsx"
foreach j of local `i'_house {
di "`j'"
}
}
However, the outcome on the screen is something like:
city1
project100
project99
······
project1
It seems the code only loops one round, over the first city, but fails to come to city2, city3 and so on. I suspect it's due to my problematic writing of the local, especially in this line but I'm not sure:
foreach j of local `i'_house
Although not a solution to whatever problem you're actually presenting, an easier way might be to use filelist, from SSC (ssc install filelist).
An example might be:
. // list all files
. filelist, directory("D:\Datos\RFERRER\Desktop\example")
Number of files found = 5
.
. // strange way of tagging directories ending in "\house"
. // change at will
. gen tag = substr(reverse(dirname),1,6) == "esuoh/"
.
. order tag
. list
+----------------------------------------------------------------------------------------------+
| tag dirname filename fsize |
|----------------------------------------------------------------------------------------------|
1. | 0 D:\Datos\RFERRER\Desktop\example/proj_1 newfile.txt 0 |
2. | 1 D:\Datos\RFERRER\Desktop\example/proj_2/house somefile.txt 0 |
3. | 0 D:\Datos\RFERRER\Desktop\example/proj_3/subproj_3_2 newfile2.txt 0 |
4. | 1 D:\Datos\RFERRER\Desktop\example/proj_3/subproj_3_2/house anothernewfile.txt 0 |
5. | 1 D:\Datos\RFERRER\Desktop\example/proj_3/subproj_3_2/house someotherfile.txt 0 |
+----------------------------------------------------------------------------------------------+
Afterwards, use keep or drop, conditional on variable tag.
Graphically, the directory looks like:
(I'm on Stata 13. Check help string functions for other ways to tag.)
Your revised problem may yield to
local folder: dir . dirs "*"
foreach i of local folder {
di "`i'"
local house : dir "G:\Data_backup\Soufang_data/`i'\house" files "*.xlsx"
foreach j of local house {
di "`j'"
}
}
but clearly we can't see your file structure or file names.

Equivalent of SQL LIKE operator in R

In an R script, I have a function that creates a data frame of files in a directory that have a specific extension.
The dataframe is always two columns with however many rows as there are files found with that specific extension.
The data frame ends up looking something like this:
| Path | Filename |
|:------------------------:|:-----------:|
| C:/Path/to/the/file1.ext | file1.ext |
| C:/Path/to/the/file2.ext | file2.ext |
| C:/Path/to/the/file3.ext | file3.ext |
| C:/Path/to/the/file4.ext | file4.ext |
Forgive the archaeic way that I express this question. I know that in SQL, you can apply where functions with like instead of =. So I could say `where Filename like '%1%' and it would pull out all files with a 1 in the name. Is there a way use something like this to set a variable in R?
I have a couple of different scripts that need to use the Filename pulled from this dataframe. The only reliable way I can think to tell the script which one to pull from is to set a variable like this.
Ultimately I would like these two (pseudo)expressions to yield the same thing.
x <- file1.ext
and
x like '%1%'
should both give x = file1.ext
you can use grepl() as in this answer
subset(a, grepl("1", a$filename))
Or if you're coming from an SQL background, you might want to look into sqldf
you can use like from data.table to get your sql like behaviour here.
From the documentation see this example
library(data.table)
DT = data.table(Name=c("Mary","George","Martha"), Salary=c(2,3,4))
DT[Name %like% "^Mar"]
for your problem suppose you have a data.frame df like this
path filename
1: C:/Path/to/the/file1.ext file1.ext
2: C:/Path/to/the/file2.ext file2.ext
3: C:/Path/to/the/file3.ext file3.ext
4: C:/Path/to/the/file4.ext file4.ext
do
library(data.table)
DT<-as.data.table(df)
DT[filename %like% "1"]
should give
path filename
1: C:/Path/to/the/file1.ext file1.ext

search for specific characters within column and then create different columns from it

I have param_Value column that have different values. I need to extract these values and create columns for all of them.
|PARAM_NAME |param_Value |
__________|____________
|Step 4 | SP:0.09 |
|Procedure | MAX:125 |
|Step 4 | SP:Ambient|
|(null) | +/-:N/A |
|Steam | SP:2 |
|Step 3 | MIN:0 |
|Step 4 | RDPHN427B |
|Testing De | N/A |
I only want columns with: And give them names:
SP: SET_POINT_VALUE,
MAX: MAX_LIMIT,
MIN: MIN_LIMIT,
+/-: UPPER_LOWER_LIMIT
So what I have so far is:
CREATE OR REPLACE FORCE VIEW PROCESS_STEPS
("PARAM_NAME", "SET_POINT_VALUE", "UPPER_LOWER_LIMIT", "MAX_VALUE", "MIN_VALUE")
AS
SELECT PARAM_NAME,
REGEXP_LIKE("param_Value", 'SP:') SET_POINT_VALUE,
REGEXP_LIKE("param_Value", '+/-:') UPPER_LOWER_LIMIT,
REGEXP_LIKE("param_Value", 'MAX:') MAX_VALUE,
REGEXP_LIKE("param_Value", 'MIN:') MIN_VALUE
FROM PROCESS_STEPS
;
I'm more familiar with TSQL and MySQL, but this ought to do what I think you're looking for. If it doesn't exactly, it should at least point you in the right direction.
CREATE OR REPLACE FORCE VIEW PROCESS_STEPS
("PARAM_NAME", "SET_POINT_VALUE", "UPPER_LOWER_LIMIT", "MAX_VALUE", "MIN_VALUE")
AS
SELECT PARAM_NAME
, CASE WHEN "param_Value" LIKE 'SP:%'
THEN SUBSTR("param_Value", INSTR("param_Value", ':')+1)
ELSE Null
END SET_POINT_VALUE
, CASE WHEN "param_Value" LIKE '+/-:%'
THEN SUBSTR("param_Value", INSTR("param_Value", ':')+1)
ELSE Null
END UPPER_LOWER_LIMIT
, CASE WHEN "param_Value" LIKE 'MAX:%'
THEN SUBSTR("param_Value", INSTR("param_Value", ':')+1)
ELSE Null
END MAX_VALUE
, CASE WHEN "param_Value" LIKE 'MIN:%'
THEN SUBSTR("param_Value", INSTR("param_Value", ':')+1)
ELSE Null
END MIN_VALUE
FROM PROCESS_STEPS
;
The basic concept here is identifying the information you want via LIKE, then using SUBSTR and INSTR to extract it. While LIKE is normally something to stay away from, since there's no leading % in your case, it's Sargable, and thus probably not a total efficiency sink.
Really, though, I have to ask you to question why you're laying out your data like this - substring operations are slow in any language, and a DB is no exception. Why not use another column for your limit type? Why not lay it out in the view you're currently looking at?