how do I loop through file names in stata

how do I loop through file names in stata - stata

1) Is it possible to create a vector of strings in stata? 2) If yes, is it then possible to loop through the elements in this vector, performing commands on each element?
To create a single string in stata I know you do this:
local x = "a string"
But I have about 200 data files I need to loop through, and they are not conveniently named with consecutive suffixes like "_2000" "_2001" "_2002" etc. In fact there is no rhyme or reason to the file names, but I do have a list of them which I could easily cut and paste into a string vector, and then call the elements of this vector one by one, as one might do in MATLAB.
Is there a way to do this in stata?

On top of Keith's answer: you can also get the list of files in a directory with
local myfilelist : dir . files "*.dta"
or more generally
local theirfilelist : dir <directory name> files <file mask>
See help extended_fcn.

Sure -- You just create a list using a typical local call. If you don't put quotes around the whole thing your lists can be really long.
local mylist aaa bbb "cc c" dd ee ff
Then you just use foreach.
foreach filename of local mylist {
use `"`filename'"'
}
The double quotes (`" "') are used because one of the filenames has quotes around it because of the space. This is a touch faster than putting foreach filename in `mylist' { on the first line.
If you want to manipulate your list, see help macrolists.
Related questions have been asked >1 time on stackoverflow:
In Stata how do you assign a long list of variable names to a local macro?
Equivalent function of R's "%in%" for Stata

What many people might want the combination of the two as I did. Here it is:
* Create a local containing the list of files.
local myfilelist : dir "." files "*.dta"
* Or manually create the list by typing in the filenames.
local myfilelist "file1.dta" "file2.dta" "file3.dta"
* Then loop through them as you need.
foreach filename of local myfilelist {
use "`filename'"
}
I hope that helps. Note that locals/macros are limited by 67,784 characters--watch out for this when you have a really long list of files or really long filenames.

Related

Splitting long file path in Stata

Assume that I have a long file path (80+ characters) from my current working folder:
use .\random_folders_name\project1\secret_data\survey_data\big_constructed_file.dta
I am looking for a way to split it into two lines to comply with a 80-character-line standard.
I've tried
use .\random_folders_name\project1\secret_data\survey_data///
\big_constructed_file.dta
and
use ".\random_folders_name\project1\secret_data\survey_data"///
+ "\big_constructed_file.dta"
without success.
I would prefer to not change the working directory as that would make necessary to change it back.

+ can be used for string concatenation but only within an expression to be evaluated.
This works
clear
set obs 1
gen whatever = "a" + "b"
and this works
local whatever = "a" + "b"
di "`whatever'"
Putting one or more parts of a string in a local macro is one way to do what you want and what I would recommend if writing within 80 characters on a line.
local dir ".\random_folders_name\project1\secret_data\survey_data\"
use "`dir'big_constructed_file.dta"
You could do this:
local name = ".\random_folders_name\project1\secret_data\survey_data" + ///
"\big_constructed_file.dta"
use "`name'"
That's the closest I could get to taking your approach and making it work.
On backslashes, watch out: http://www.stata-journal.com/sjpdf.html?articlenum=pr0042

How do I loop over part of a variable name?

I need to use a local macro to loop over part of a variable name in Stata.
Here is what I tried to do:
local phth mep mibp mbp
tab lod_`phth'_BL
Stata will not recognize the entire variable name.
variable lod_mep not found
r(111);
If I remove the underscore after the `phth' it still does not recognize anything after the macro name.
I want to avoid using a complicated foreach loop.
Is there any way this can be done just using the simple macro?
Thanks!

Your request is a bit confusing. First, this is precisely the purpose of a loop, and second, loops in Stata are (at the "introductory level") quite simple. The following example is a bit nonsensical (and given the structure, there are easier ways of going about this), but should convey the basic idea.
// set up a similar variable name structure
sysuse auto , clear
rename (price mpg weight length) ///
(pref_base1_suff pref_base2_suff pref_base3_suff pref_base4_suff)
// define a local macro to hold the elements to loop over
local varbases = "base1 base2 base3 base4"
// refer to the items of the local macro in a loop
foreach b of local varbases {
summ pref_`b'_suff
}
See help foreach for the syntax of foreach. In particular, note that the structure employed above may not even be required due to Stata's varlist structure (see help varlist). For example, continuing with the code above:
foreach v of varlist pref_base?_suff {
summ `v'
}
The wildcard ? takes the place of one character. * could be used for more flexibility. However, if your variables are not as easily identifiable using the pattern matching allowed by varlist, a loop as in the first example is simple enough -- four very short lines of code.
Postscript
Upon further reflection (sometimes the structure of the question anchors a certain method, when an alternative approach is more straightforward), searching the help files for information on the tabulate command (help tabulate) will direct you to the following syntax: tab1 varlist [if] [in] [weight] [, tab1_options]
Given the discussion above about the use of varlists, you can simply code
tab1 lod_m*_BL
assuming, of course, that there are no other variables matching the pattern for which you do not want to report a frequency table. Alternatively,
tab1 lod_mep_BL lod_mibp_BL lod_mbp_BL
is not much longer and does the trick, albeit without the use of any sort of wildcard or macro substitution.

How to use regexp to find unique combinations of letters and use them as variables in Matlab?

I have the file names of four files stored in a cell array called F2000. These files are named:
L14N_2009_2000MHZ.txt
L8N_2009_2000MHZ.txt
L14N_2010_2000MHZ.txt
L8N_2009_2000MHZ.txt
Each file consists of an mxn matrix where m is the same but n varies from file to file. I'd like to store each of the L14N files and each of the L8N files in two separate cell arrays so I can use dlmread in a for loop to store each text file as a matrix in an element of the cell array. To do this, I wrote the following code:
idx2009=cellfun('isempty',regexp(F2000,'L\d{1,2}N_2009_2000MHZ.txt'));
F2000_2009=F2000(idx2009);
idx2010=~idx2009;
F2000_2010=F2000(idx2010);
cell2009=cell(size(F2000_2009));
cell2010=cell(size(F2000_2010));
for k = 1:numel(F2000_2009)
cell2009{k}=dlmread(F2000_2009{k});
end
and repeated a similar "for" loop to use on F2000_2010. So far so good. However.
My real data set is much larger than just four files. The total number of files will vary, although I know there will be five years of data for each L\d{1,2}N (so, for instance, L8N_2009, L8N_2010, L8N_2011, L8N_2012, L8N_2013). I won't know what the number of files is ahead of time (although I do know it will range between 50 and 100), and I won't know what the file names are, but they will always be in the same L\d{1,2}N format.
In addition to what's already working, I want to count the number of files that have unique combinations of numbers in the portion of the filename that says L\d{1,2}N so I can further break down F2000_2010 and F2000_2009 in the above example to F2000_2010_L8N and F2000_2009_L8N before I start the dlmread loop.
Can I use regexp to build a list of all of my unique L\d{1,2}N occurrences? Next, can I easily change these list elements to strings to parse the original file names and create a new file name to the effect of L14N_2009, where 14 comes from \d{1,2}? I am sure this is a beginner question, but I discovered regexp yesterday! Any help is much appreciated!

Here is some code which might help:
% Find all the files in your directory
files = dir('*2000MHZ.txt');
files = {files.name};
% match identifiers
ids = unique(cellfun(#(x)x{1},regexp(files,'L\d{1,2}N','match'),...
'UniformOutput',false));
% find all years
years = unique(cellfun(#(x)x{1},regexp(files,'(?<=L\d{1,2}N_)\d{4,}','match'),...
'UniformOutput',false));
% find the years for each identifier
for id_ix = 1:length(ids)
% There is probably a better way to do this
list = regexp(files,['(?<=' ids{id_ix} '_)\d{4,}'],'match');
ids_years{id_ix} = cellfun(#(x)x{1},list(cellfun(...
#(x)~isempty(x),list)),'uniformoutput',false);
end
% If you need dynamic naming, I would suggest dynamic struct names:
for ix_id = 1:length(ids)
for ix_year = 1:length(ids_years{ix_id})
% the 'Y' is in the dynamic name becuase all struct field names must start with a letter
data.(ids{ix_id}).(['Y' ids_years{ix_id}{ix_year}]) =...
'read in my data here for each one';
end
end
Also, if anyone is interested in mapping keys with values try looking into the containers.map class.

Stata: Appending multiple files and extracting variables from file names

I have 114 files with .dat extension to convert to Stata/SE and append, with substantial number of variables (varying from 81 to 16800). I have reset max number of variables to 32000 (set maxvar 32000), increased the memory (set mem 500m) and I was using the following algorithm to combine large number of files and to generate several variables by extracting parts of file names: http://www.ats.ucla.edu/stat/stata/faq/append_many_files.htm
The code looks as follows:
cd "C:\Users\..."
! dir *.dat /a-d /b >d:\Stata_directory\Products_batchfilelist.txt
file open myfile using "d:\Stata_directory\Products_batchfilelist.txt", read
file read myfile line
drop _all
insheet using `line', comma names
gen n = substr("`line'",10,1)
gen m = substr("`line'",12,1)
gen playersnum = substr("`line'",14,1)
save Products_merged.dta, replace
drop _all
file read myfile line
while r(eof)==0 {
insheet using `line', comma names
gen n = substr("`line'",10,1)
gen m = substr("`line'",12,1)
generate playersnum = substr("`line'",14,1)
save `line'.dta, replace
append using Products_merged.dta
save Products_merged.dta,replace
drop _all
file read myfile line
}
The problem is that although variables n,m,playersnumextracted from file names are present in each individual file, they disappear in the final "Products_merged.dta" file. Could anyone tell me what could be the problem and if it is possible to solve with Stata/SE?

I don't see an obvious problem with the code that would be causing this. It may have something to do with the limits in SE, but that is still unlikely in my mind (you would see an error if a command does something to exceed maxvar).
My only suggestion would be to put a couple commands inside the append loop that will help you debug:
save `line'.dta, replace
append using Products_merged.dta
assert m!="" & n!="" & playersnum!=""
save Products_merged.dta,replace
This will do two things: ensure your variables exist after each new append (your first-order concern), and check that they are never blank (not your stated concern but a good check anyway).
If you post a couple of the files I could probably give a better answer.

transfer values from one variable to another in Stata

I have a problem at work: I have merged two datasets, and there is a number of variables which have the same content, but where an observation which has an value in the variable from dataset 1 have a missing-value in dataset 2. So I need to transfer the values from the one variable into the other one.
This is my best shot so far:
replace V23=1 if V232==1
replace V23=2 if V232==2
replace V23=3 if V232==3
replace V23=4 if V232==4
replace V23=8 if V232==8
replace V23=.u if V232==10 | V232==9
However, it is a tedious task to do that for 40+ variables - and since some of them are numerical variables, it becomes a a sisyphean task.

Here's a start:
foreach v of varlist v23 {
local w `v'2
replace `v' = `w' if missing(`v')
replace `v' = .u if `w' == 10 | `w' == 9
}
Notice how this solution relies on a lexical relationship among the variable names: it assumes the old variable "v23" is associated with the new variable "v232". You can make a list of such associations and use it, but this is inconvenient. It's probably easier to rename the variables, if necessary, to conform to such a convention, then run the replacement script, and then restore the desired names.
If you're unfamiliar with this kind of automation, read the help pages for macro and foreach.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js