Import file from a folder based on a regular expression - stata

I am working with DHS data, which involves various data files with a consistent naming located in different folders. Each folder contains data for a specific country and survey year.
I would like to import datasets whose name consists of the component 'HR' for example I have ETHR41FL.DTA. The 'HR' part is consistent but other components of the name vary depending on country and survey year. I need to work with one dataset at a time and then move to the next so I believe an automated search would be helpful.
Running the command below gives:
dir "*.dta"
42.6M 5/17/07 10:49 ETBR41FL.dta
19.4M 7/17/06 12:32 ETHR41FL.DTA
60.5M 7/17/06 12:33 ETIR41FL.DTA
10.6M 7/17/06 12:33 ETKR41FL.DTA
234.4k 4/05/07 12:36 ETWI41FL.DTA
I have tried the following approach which did not go through as desired and might not be the best or most direct approach:
local datafiles : dir . files "*.dta" //store file names in a macro
di `datafiles'
etbr41fl.dtaethr41fl.dtaetir41fl.dtaetkr41fl.dtaetwi41fl.dta
The next step I think would be to store the value of the macro datafiles above into a variable (since strupper does not seem to work with macros but variables) and then convert to uppercase and extract the string ETHR41FL.dta. However, I encounter a problem when I do this:
local datafiles : dir . files "*.dta" //store file names in a macro
gen datafiles= `datafiles'
invalid '"ethr41fl.dta'
If I try the command below it works but gives a variable of empty values:
local datafiles : dir . files "*.dta" //store file names in a macro
gen datafiles= "`datafiles'"
How can I store the components of datafiles into a new variable?
If this works I could then extract the required string using a regular expression and import the dataset:
gen targetfile= regexs(0) if(regexm(`datafiles', "[A-Z][A-Z][H][R][0-9][0-9][A-Z][A-Z]"))
However, I would also appreciate a different approach.

Following Nick's advice to continue working with local macros rather than putting filenames into Stata variables, here is some technique to accomplish your stated objective. I agree with Nick to ignore the capitalization of the filenames provided by Windows, which is a case-insensitive filesystem. My example will work with case-sensitive filesystems, but will match any upper- or lower- or mixed-case filenames.
. dir *.dta
-rw-r--r-- 1 lisowskiw staff 1199 Jan 18 10:04 a space.dta
-rw-r--r-- 1 lisowskiw staff 1199 Jan 18 10:04 etbr41fl.dta
-rw-r--r-- 1 lisowskiw staff 1199 Jan 18 10:04 ethr41fl.dta
-rw-r--r-- 1 lisowskiw staff 1199 Jan 18 10:04 etir41fl.dta
-rw-r--r-- 1 lisowskiw staff 1199 Jan 18 10:04 etkr41fl.dta
-rw-r--r-- 1 lisowskiw staff 1199 Jan 18 10:04 etwi41fl.dta
. local datafiles : dir . files "*.dta"
. di `"`datafiles'"'
"a space.dta" "etbr41fl.dta" "ethr41fl.dta" "etir41fl.dta" "etkr41fl.dta" "etwi41fl.dta"
. foreach file of local datafiles {
2. display "`file' testing"
3. if regexm(upper("`file'"),"[A-Z][A-Z][H][R][0-9][0-9][A-Z][A-Z]") {
4. display "`file' matched!"
5. // process file here
. }
6. }
a space.dta testing
etbr41fl.dta testing
ethr41fl.dta testing
ethr41fl.dta matched!
etir41fl.dta testing
etkr41fl.dta testing
etwi41fl.dta testing

You can use filelist (from SSC) to create a dataset of file names. You can then leverage the full set of Stata data management tools to identify the file you want to target. To install filelist, type in Stata's command window:
ssc install filelist
Here's a quick example with datasets that follow the example provided:
. filelist, norecur
Number of files found = 6
. list if strpos(upper(filename),".DTA")
+---------------------------------+
| dirname filename fsize |
|---------------------------------|
1. | . ETBR41FL.dta 12,207 |
2. | . ETHR41FL.DTA 12,207 |
3. | . ETIR41FL.DTA 12,207 |
4. | . ETKR41FL.DTA 12,207 |
5. | . ETWI41FL.DTA 12,207 |
+---------------------------------+
. keep if regexm(upper(filename), "[A-Z][A-Z][H][R][0-9][0-9][A-Z][A-Z]")
(5 observations deleted)
. list
+---------------------------------+
| dirname filename fsize |
|---------------------------------|
1. | . ETHR41FL.DTA 12,207 |
+---------------------------------+
.
. * with only one observation in memory, use immediate macro expansion
. * to form the file name to read in memory
. use "`=filename'", clear
(1978 Automobile Data)
. describe, short
Contains data from ETHR41FL.DTA
obs: 74 1978 Automobile Data
vars: 12 18 Jan 2016 11:58
size: 3,182
Sorted by: foreign

I find the question very puzzling as it is about extracting a particular filename; but if you know the filename you want, you can just type it directly. You may need to revise your question if the point is different.
However, let's discuss some technique.
Putting Stata variable names inside Stata variables (meaning, strictly, columns in the dataset) is possible in principle, but it is only rarely the best idea. You should keep going in the direction you started, namely defining and then manipulating local macros.
In this case the variable element can be extracted by inspection, but let's show how to remove some common elements:
. local names etbr41fl.dta ethr41fl.dta etir41fl.dta etkr41fl.dta etwi41fl.dta
. local names : subinstr local names ".dta" "", all
. local names : subinstr local names "et" "", all
. di "`names'"
br41fl hr41fl ir41fl kr41fl wi41fl
That's enough to show more technique, which is that you can loop over such names. In fact with the construct you illustrate you can do that any way, and neither regular expressions nor anything else is needed:
. local datafiles : dir . files "*.dta"
. foreach f of local datafiles {
... using "`f'"
}
. foreach n of local names {
... using "et`n'.dta"
}
The examples here show a detail when giving literal strings, namely that " " are often needed as delimiters (and rarely harmful).
Note. Upper case and lower case in file names is probably irrelevant here. Stata will translate.
Note. You say that
. gen datafiles = "`datafiles'"
gives empty values. That's likely to be because you executed that statement in a locale where the local macro was invisible. Common examples are: executing one command from a do-file editor window and another from the main command window; executing commands one by one from a do-file editor window. That's why local macros are so named; they are only visible within the same block of code.

In this particular case you do not really need to use a regular expression.
The strmatch() function will do the job equally well:
local datafiles etbr41fl.dta ethr41fl.dta etir41fl.dta etkr41fl.dta etwi41fl.dta
foreach x of local datafiles {
if strmatch(upper("`x'"), "*HR*") display "`x'"
}
ethr41fl.dta
The use of the upper() function is optional.

Related

Using dictionary in regexp_replace function in pyspark

I want to perform an regexp_replace operation on a pyspark dataframe column using dictionary.
Dictionary : {'RD':'ROAD','DR':'DRIVE','AVE':'AVENUE',....}
The dictionary will have around 270 key value pair.
Input Dataframe:
ID | Address
1 | 22, COLLINS RD
2 | 11, HEMINGWAY DR
3 | AVIATOR BUILDING
4 | 33, PARK AVE MULLOHAND DR
Desired Output Dataframe:
ID | Address | Address_Clean
1 | 22, COLLINS RD | 22, COLLINS ROAD
2 | 11, HEMINGWAY DR | 11, HEMINGWAY DRIVE
3 | AVIATOR BUILDING | AVIATOR BUILDING
4 | 33, PARK AVE MULLOHAND DR | 33, PARK AVENUE MULLOHAND DRIVE
I cannot find any documentation on internet. And if trying to pass dictionary as below codes-
data=data.withColumn('Address_Clean',regexp_replace('Address',dict))
Throws an error "regexp_replace takes 3 arguments, 2 given".
Dataset will be around 20 million in size. Hence, UDF solution will be slow (due to row wise operation) and we don't have access to spark 2.3.0 which supports pandas_udf.
Is there any efficient method of doing it other than may be using a loop?
It is trowing you this error because regexp_replace() needs three arguments:
regexp_replace('column_to_change','pattern_to_be_changed','new_pattern')
But you are right, you don't need a UDF or a loop here. You just need some more regexp and a directory table that looks exactly like your original directory :)
Here is my solution for this:
# You need to get rid of all the things you want to replace.
# You can use the OR (|) operator for that.
# You could probably automate that and pass it a string that looks like that instead but I will leave that for you to decide.
input_df = input_df.withColumn('start_address', sf.regexp_replace("original_address","RD|DR|etc...",""))
# You will still need the old ends in a separate column
# This way you have something to join on your directory table.
input_df = input_df.withColumn('end_of_address',sf.regexp_extract('original_address',"(.*) (.*)", 2))
# Now we join the directory table that has two columns - ends you want to replace and ends you want to have instead.
input_df = directory_df.join(input_df,'end_of_address')
# And now you just need to concatenate the address with the correct ending.
input_df = input_df.withColumn('address_clean',sf.concat('start_address','correct_end'))

Codeeval Challenge 230: Football, Answer Only Partially Correct

I am working on a relatively new challenge in CodeEval called 'Football.' The description is listed in the following link:
https://www.codeeval.com/open_challenges/230/
Inputs are lines of a file read by Python, and within each line there are lists separated by '|', with each list representing a country: the first being country "1", second being country "2", and so on.
1 2 3 4 | 3 1 | 4 1
19 11 | 19 21 23 | 31 39 29
Outputs are also lines in response to each line read from the file.
1:1,2,3; 2:1; 3:1,2; 4:1,3;
11:1; 19:1,2; 21:2; 23:2; 29:3; 31:3; 39:3;
so country 1 supports team 1, 2, and 3 as shown in the first line of output: 1:1,2,3.
Below is my solution, and since I have no clue why the solution only works for the two sample cases lited in the description link, I'd like to ask anyone for comments and hints on how to correct my code. Thank you very much for your time and assistance ahead of time.
import sys
def football(string):
countries = map(str.split, string.split('|'))
teams = sorted(list(set([i[j] for i in countries for j in range(len(i))])))
results = []
for i in range(len(teams)):
results.append([teams[i]+':'])
for j in range(len(countries)):
if teams[i] in countries[j]:
results[i].append(str(j+1))
for i in range(len(results)):
results[i] = results[i][0]+','.join(results[i][1:])
return '; '.join(results) + '; '
if __name__ == '__main__':
lines = [line.rstrip() for line in open(sys.argv[1])]
for line in lines:
print football(line)
After deliberately failing an attempt to checkout the complete test input and my output, I found the problem. The line:
teams = sorted(list(set([i[j] for i in countries for j in range(len(i))])))
will make the output problematic in terms of sorting. For example here's a sample input:
10 20 | 43 23 | 27 | 25 | 11 1 12 43 | 33 18 3 43 41 | 31 3 45 4 36 | 25 29 | 1 19 39 | 39 12 16 28 30 37 | 32 | 11 10 7
and it produces the output:
1:5,9; 10:1,12; 11:5,12; 12:5,10; 16:10; 18:6; 19:9; 20:1; 23:2; 25:4,8; 27:3; 28:10; 29:8; 3:6,7; 30:10; 31:7; 32:11; 33:6; 36:7; 37:10; 39:9,10; 4:7; 41:6; 43:2,5,6; 45:7; 7:12;
But the challenge expects the output teams to be sorted by numbers in ascending order, which is not achieved by the above-mentioned code as the numbers are in string format, not integer format. Therefore the solution is simply adding a key to sort the teams list by ascending order for integer:
teams = sorted(list(set([i[j] for i in countries for j in range(len(i))])), key=lambda x:int(x))
With a small change in this line, the code passes through the tests. A sample output looks like:
1:5,9; 3:6,7; 4:7; 7:12; 10:1,12; 11:5,12; 12:5,10; 16:10; 18:6; 19:9; 20:1; 23:2; 25:4,8; 27:3; 28:10; 29:8; 30:10; 31:7; 32:11; 33:6; 36:7; 37:10; 39:9,10; 41:6; 43:2,5,6; 45:7;
Please let me know if you have a better and more efficient solution to the challenge. I'd love to read better codes or great suggestions on improving my programming skills.
Here's how I solved it:
import sys
with open(sys.argv[1]) as test_cases:
for test in test_cases:
if test:
team_supporters = {}
for nation, nation_teams in enumerate(test.strip().split("|"), start=1):
for team in map(int, nation_teams.split()):
team_supporters.setdefault(team, []).append(nation)
print(*("{}:{};".format(team, ",".join(map(str, sorted(nations))))
for team, nations in sorted(team_supporters.items())))
The problem is not very complicated. We're given a mapping from nation (implicitly numbered by their order in the input) to a list of teams. We need to reverse that to create an output that maps from a team to a list of nations.
It seems natural to use a dictionary that maps in the same way as the desired output. We can use enumerate to give numbers to the nations as we iterate over them. The setdefault method of the dict adds empty lists to the dictionary as they are needed (using a collections.defaultdict instead of a regular dictionary would be another way to deal with this). We don't need to care about the order of the input, nor the order things are stored in the dictionary's inner lists.
The output we build using str.format calls and the default space separator of the print function. If the final semicolon wasn't desired, I'd have used print("; ".join("{}:{}.format(...))) instead. Since the output needs to be sorted by team at the top level, and by nation in the inner lists, we make some sorted calls where necessary.
Sorting the inner lists is probably not even be necessary, since the nations were processed in order, with their numbers derived from the order they had in the input line. Fortunately, Python's Timsort algorithm is very fast on already-sorted input, so even with a bit of unnecessary sorting, our code is still fast enough.

How to acquire complete list of subdirs (including subdirs of subdirs)?

I have thousands of city folders (for example city1, city2, and so on, but in reality named like NewYork, Boston, etc.). Each folder further contains two subfolders: land and house.
So the directory structure is like:
current dictionary
---- city1
----- house
------ many .xlsx files
----- land
----- city2
----- city3
···
----- city1000
I want to get the complete list of all subdirs and do some manipulation (like import excel). I know there is a macro extended function: local list: dir to handle this issue, but it seems it can only return the first tier of subdirs, like city_i, rather than those deeper ones.
More specifically, if I want to take action within all house folders, what kind of workflow do I need?
I have made an initial attempt to write code to achieve my goal:
cd G:\Data_backup\Soufang_data
local folder: dir . dirs "*"
foreach i of local folder {
local `i'_house : dir "G:\Data_backup\Soufang_data\``i''\house" files "*.xlsx"
local count = 1
foreach j of local `i'_house {
cap import excel "`j'",clear
cap sxpose,clear
cap drop in 1/1
if `count'==1 {
save `i'.dta, replace
}
else {
cap qui append using `i'
save `i'.dta,replace
}
local ++count
}
}
There is something wrong with:
``i''
in the dir, I struggled to make it work without success, anyway.
I have another post on this project.
Supplementary remarks:
As Nick points out, it's the back slash that causes the trouble. Moving from that point, however, I encounter another problem. Say, without the complicated actions, I just want to test if my loops work, so I write the following code snippet:
set more off
cd G:\Data_backup\Soufang_data
local folder: dir . dirs "*"
foreach i of local folder {
di "`i'"
local `i'_house : dir "G:\Data_backup\Soufang_data/`i'\house" files "*.xlsx"
foreach j of local `i'_house {
di "`j'"
}
}
However, the outcome on the screen is something like:
city1
project100
project99
······
project1
It seems the code only loops one round, over the first city, but fails to come to city2, city3 and so on. I suspect it's due to my problematic writing of the local, especially in this line but I'm not sure:
foreach j of local `i'_house
Although not a solution to whatever problem you're actually presenting, an easier way might be to use filelist, from SSC (ssc install filelist).
An example might be:
. // list all files
. filelist, directory("D:\Datos\RFERRER\Desktop\example")
Number of files found = 5
.
. // strange way of tagging directories ending in "\house"
. // change at will
. gen tag = substr(reverse(dirname),1,6) == "esuoh/"
.
. order tag
. list
+----------------------------------------------------------------------------------------------+
| tag dirname filename fsize |
|----------------------------------------------------------------------------------------------|
1. | 0 D:\Datos\RFERRER\Desktop\example/proj_1 newfile.txt 0 |
2. | 1 D:\Datos\RFERRER\Desktop\example/proj_2/house somefile.txt 0 |
3. | 0 D:\Datos\RFERRER\Desktop\example/proj_3/subproj_3_2 newfile2.txt 0 |
4. | 1 D:\Datos\RFERRER\Desktop\example/proj_3/subproj_3_2/house anothernewfile.txt 0 |
5. | 1 D:\Datos\RFERRER\Desktop\example/proj_3/subproj_3_2/house someotherfile.txt 0 |
+----------------------------------------------------------------------------------------------+
Afterwards, use keep or drop, conditional on variable tag.
Graphically, the directory looks like:
(I'm on Stata 13. Check help string functions for other ways to tag.)
Your revised problem may yield to
local folder: dir . dirs "*"
foreach i of local folder {
di "`i'"
local house : dir "G:\Data_backup\Soufang_data/`i'\house" files "*.xlsx"
foreach j of local house {
di "`j'"
}
}
but clearly we can't see your file structure or file names.

How to Regex in a script to gzip log files

i would like to gzip log files but i cannot work out how to run a regex expression in my command.
My Log file look like this, they roll every hour.
-rw-r--r-- 1 aus nds 191353 Sep 28 01:59 fubar.log.20150928-01
-rw-r--r-- 1 aus nds 191058 Sep 28 02:59 fubar.log.20150928-02
-rw-r--r-- 1 aus nds 190991 Sep 28 03:59 fubar.log.20150928-03
-rw-r--r-- 1 aus nds 191388 Sep 28 04:59 fubar.log.20150928-04
script.
FUBAR_DATE=$(date -d "days ago" +"%Y%m%d ")
fubar_file="/apps/fubar/logs/fubar.log."$AUS_DATE"-^[0-9]"
/bin/gzip $fubar_file
i have tried a few varients on using the regex but without success, can you see the simple error in my code.
Thanks in advace
I did:
$ fubar_file="./fubar.log."${FUBAR_DATE%% }"-[0-9][0-9]"
and it worked for me.
Why not make fubar_file an array to hold the matching log file names, and then use a loop to gzip them individually. Then presuming AUS_DATE contains 20150928:
# FUBAR_DATE=$(date -d "days ago" +"%Y%m%d ") # not needed for gzip
fubar_file=( /apps/fubar/logs/fubar.log.$AUS_DATE-[0-9][0-9] )
for i in ${fubar_file[#]}; do
gzip "$i"
done
or if you do not need to preserve the filenames in the array for later use, just gzip the files with a for loop:
for i in /apps/fubar/logs/fubar.log.$AUS_DATE-[0-9][0-9]; do
gzip "$i"
done
or, simply use find to match the files and gzip them:
find /apps/fubar/logs -type f -name "fubar.log.$AUS_DATE-[0-9][0-9]" -execdir gzip '{}' +
Note: all answers presume AUS_DATE contains 20150928.

How to retrieve data from multiple Stata files?

I have 53 Stata .dta files each of them is 150 - 200 Mb and contain identical set of variables, but for different years. It is not useful to combine or merge them due to their size .
I need to retrieve some averaged values (percentages etc.) Therefore, I want to create a new Stata file New.dta and write a .do file that would run on that new Stata file in the following way: it should open each of those 53 Stata files, make certain calulations, and store the results in the new Stata file, New.dta.
I am not sure how i can keep two Stata file open simultaneuosly, and how can i store the calculated values?
When I open a second .dta file, how can i make the first one still be open? How can i store the calculated values in the global variable?
What springs to mind here is the use of postfile.
Here is a simple example. First, I set up an example of several datasets. You already have this.
clear
forval i = 1/10 {
set obs 100
gen foo = `i' * runiform()
save test`i'
clear
}
Now I set up postfile. I need to set up a handle, what variables will be used, and what file will be used. Although I am using a numeric variable to hold file identifiers, it will perhaps be more typical to use a string variable. Also, looping over filenames may be a bit more challenging than this. fs from SSC is a convenience command that helps put a set of filenames into a local macro; its use is not illustrated here.
postfile mypost what mean using alltest.dta
forval i = 1/10 {
use test`i', clear
su foo, meanonly
post mypost (`i') (`r(mean)')
}
Now flush results
postclose mypost
and see what we have.
u alltest
list
+-----------------+
| what mean |
|-----------------|
1. | 1 .5110765 |
2. | 2 1.016858 |
3. | 3 1.425967 |
4. | 4 2.144528 |
5. | 5 2.438035 |
|-----------------|
6. | 6 3.030457 |
7. | 7 3.356905 |
8. | 8 4.449655 |
9. | 9 4.381101 |
10. | 10 5.017308 |
+-----------------+
I didn't use any global macros (not global variables) here; you should not need to.
An alternative approach is to loop over files and use collapse to "condense" these files to the relevant means, and than append these condensed files. Here is an adaptation of Nick's example:
// create the example datasets
clear
forval i = 1/10 {
set obs 100
gen foo = `i' * runiform()
gen year = `i'
save test`i', replace
clear
}
// use collapse and append
// to create the dataset you want
use test1, clear
collapse (mean) year foo
save means, replace
forvalues i = 2/10 {
use test`i', clear
collapse (mean) year foo
append using means
save means, replace
}
// admire the result
list
Note that if your data sets are not named sequentially like test1.dta, test2.dta, ..., test53.dta, but rather like results-alaska.dta, result_in_alabama.dta, ..., "wyoming data.dta" (note the space and hence the quotes), you would have to organize the cycle over these files somewhat differently:
local allfiles : dir . files "*.dta"
foreach f of local allfiles {
use `"`f'"', clear
* all other code from Maarten's or Nick's approach
}
This is a more advanced of local macros, see help extended macro functions. Note also that Stata will produce a list that will look like "results-alaska.dta" "result_in_alabama.dta" "wyoming data.dta" with quotes around file names, so when you invoke use, you will have to enclose the file name into compound quotes.