Stata: Appending multiple files and extracting variables from file names

Stata: Appending multiple files and extracting variables from file names - stata

I have 114 files with .dat extension to convert to Stata/SE and append, with substantial number of variables (varying from 81 to 16800). I have reset max number of variables to 32000 (set maxvar 32000), increased the memory (set mem 500m) and I was using the following algorithm to combine large number of files and to generate several variables by extracting parts of file names: http://www.ats.ucla.edu/stat/stata/faq/append_many_files.htm
The code looks as follows:
cd "C:\Users\..."
! dir *.dat /a-d /b >d:\Stata_directory\Products_batchfilelist.txt
file open myfile using "d:\Stata_directory\Products_batchfilelist.txt", read
file read myfile line
drop _all
insheet using `line', comma names
gen n = substr("`line'",10,1)
gen m = substr("`line'",12,1)
gen playersnum = substr("`line'",14,1)
save Products_merged.dta, replace
drop _all
file read myfile line
while r(eof)==0 {
insheet using `line', comma names
gen n = substr("`line'",10,1)
gen m = substr("`line'",12,1)
generate playersnum = substr("`line'",14,1)
save `line'.dta, replace
append using Products_merged.dta
save Products_merged.dta,replace
drop _all
file read myfile line
}
The problem is that although variables n,m,playersnumextracted from file names are present in each individual file, they disappear in the final "Products_merged.dta" file. Could anyone tell me what could be the problem and if it is possible to solve with Stata/SE?

I don't see an obvious problem with the code that would be causing this. It may have something to do with the limits in SE, but that is still unlikely in my mind (you would see an error if a command does something to exceed maxvar).
My only suggestion would be to put a couple commands inside the append loop that will help you debug:
save `line'.dta, replace
append using Products_merged.dta
assert m!="" & n!="" & playersnum!=""
save Products_merged.dta,replace
This will do two things: ensure your variables exist after each new append (your first-order concern), and check that they are never blank (not your stated concern but a good check anyway).
If you post a couple of the files I could probably give a better answer.

Related

How can you write a script that will read files and remember a line?

So, I need to write a script that lets the program look into each folder, into the .txt file, read a number, and store it into its memory. Also, I need it to loop so the script can run and look at multiple directories so it can find the number in the files. That way, with the numbers, I can make a bar graph. I am confused on opening multiple files and storing the number into the memory.

Here are a few lines to get you started:
import glob
data = {}
filespec = r"E:\data\*\*.txt"
for filename in glob.iglob(filespec):
with open(filename) as textfile:
for line in textfile:
if line.startswith("This is the number you want:"):
data[filename] = line.split(":")[1]
break
for filename, number in data.items():
print filename,number
Now, I don't really think your textfiles have lines in them that say
This is the number you want: 42
but you haven't given us much to go on about what they do look like. And I also don't think your files reside in a folder called E:\data. You will have to edit both lines yourself before the code will do anything.

input an array in SAS

I need to read multiple raw text files into a SAS-dataset. Each file consists several ingredients as shown in the example files below. Each file (a dish) lists all the ingredients on one line, separated by a comma. The amount of ingredients is variable. Some example files (dishes):
Example file 1 (dish1.csv):
Tomate, Cheese, Ham, Bread
Example file 2 (dish2.csv):
Sugar, Apple
Example file 3 (dish3.csv):
Milk, Sugar, Cacao
Because I have about 250 files (dishes) I created a macro program to read those files. That way I can execute this macro in another macro to read all the dishes I need. The program looks like this:
%readDish (dishNumber);
data newDish;
* Find and read the csv-file;
infile "my_file_location/dish&dishNumber..csv" dlm=";" missover;
* Read up to 25 ingredients;
input ingredient1-ingredient25 : $25.;
* Put all ingredients in an array;
array ingredients{25} ingredient1-ingredient25;
* Loop thrue all the ingredients and output;
do i=1 to dim(ingredients);
dishNumber = &dishNumber;
ingredient = ingredients{i};
output;
end;
run;
%mend;
Is it possible to create a SAS (macro) program that is able to read all dishes, no matter how many ingredients I have? The SAS table should look like this:
1 Tomate
1 Cheese
1 Ham
1 Bread

Seems straightforward to me: read the data in vertically, then if you need it horizontal, add a transpose step afterwards. You don't have to read in a whole line in one step - the ## operator tells SAS to keep the line pointer on that line, so you just read in the one.
data dishes;
length _file $1024
ingredient $128;
infile "c:\temp\dish*.csv" dlm=',' filename=_file lrecl=32767; *or whatever your LRECL needs to be;
input ingredient $ ##;
dishnumber = input(compress(scan(_file,-2,'\.'),,'kd'),12.);
output;
run;
Here I use a wildcard to read them all in - you can of course us a macro with similar code if you need to, though wildcard or a concatenated filename is probably easier. The way I get dishnumber might not always work depending on the filename construction, but some form of that should be usable.
To expand on why this works: The way the datastep works in SAS is that it is a constant loop, looping over the code repeatedly until it encounters an "end condition". End conditions are, most commonly, the stop keyword, and then any attempt to read from a SET or INFILE where no further read is possible (i.e., you read a 100 line SAS dataset, and it tries to read row 101 in, fails, so ends the data step). However, other than that, it will keep doing the same code until it gets there. It just does some cleanup at the "run" point to make sure it is not infinitely looping.
In the case of input from infiles, usually SAS reads a line, then at the RUN, it will skip forward to the next EOL (end of line, usually a carriage return and linefeed in Windows) if it's not already at one. Sometimes that is useful - perhaps, usually. But, in some cases you'd rather ask SAS to keep reading the same line.
In comes the ## operator. ## says "do not advance to EOL even if you hit RUN". (# says "Do not advance to EOL except when you hit RUN" - normally input itself causes SAS to read until EOL.) Thus, when you perform the next data step iteration, the input pointer will be in the same exact place you left it - right after the previous field you read in.
This was highly useful in the 60s and 70s, when punchcards were the trendy new thing, and you would put lines of input often without regard to any line organization - in particular, if you input just one variable per row, at 8 columns per input variable, you're not wasting 72 blocks from one punchcard - so, you have input just like your ingredients: many pieces of data per row on the input, which then want to be translated into one piece of data per row in memory. While it's not as common nowadays to store data this way, this is certainly possible - as your data exemplify.

Using findstr to pass to a variable

I've got some files I'm running with a batch file that loops through everyone in a directory and dumps certain data into a sql table. I'm adding in a time stamp that I'm passing into a variable and trying to add to the sql table using sqlcmd the only problem is that to add in all relevant columns for that entry, I need to pass the names of the files that are being added to the sql table.
Okay here's the catch... the names being added to the sql table aren't the actual file names but database names that can be found in each of these xml files (close enough to xml). So I know where that is and every single one looks something like this abcdir (rest of the name) where the abcdir is a string that starts every single database.
So I thought I could use the findstr function to get the database name but I have very little experience with regex and I'd like to be able to parse out the tags and be left with just name=abcdir (rest of the name)
** * I didn't think any of my code would really be necessary since I'm just asking questions about a particular command but if thats not the case then let me know and I'll post it* **
EDIT: Okay so each file will have something like this if opened in notepad.
<Name>ABCDir Sample Name</Name>
or
<Name>ABCDir Sample Name2</Name>
and I'd like ABCDir Sample Name to be passed to a batch variable. So I thought to use findstr.
I have very little grasp of regex but I've tried using findstr >ABCDir[A-Za-z] \path\filename.ext

As I commented above, findstr (or find) will let you scrape lines containing <Name> from a text file, and for /f "delims=<>" will let you split those lines into substrings. With findstr /n, you're looking for "tokens=3 delims=<>" to get the string between <Name> and </Name>.
Try this:
#echo off
setlocal
set "file=temp.txt"
for /f "tokens=3 delims=<>" %%I in ('findstr /n /i "<Name>" "%file%"') do (
#echo %%I
)
I'm using /n with findstr to insert line numbers. The numbers aren't needed, but the switch ensures there's always a token before <Name>. Therefore, the string you want is always tokens=3 regardless of whether the line is indented or not. Otherwise, your string could be token 3 if indented, or token 2 if not. This is easier than trying to determine whether the tags are indented or not.

Can Notepad++ save out search results to a text file?

I need to do quite a few regular expression search/replaces throughout hundreds and hundreds of static files. I'm looking to build an audit trail so I at least know what files were touched by what searches/replaces.
I can do my regular expression searches in Notepad++ and it gives me file names/paths and number of hits in each file. It also gives me the line #s which I don't really care that much about.
What I really want is a separate text file of the file names/paths. The # of hits in each file would be a nice addition, but really it's just a list of file names/paths that I'm after.
In Notepad++'s search results pane, I can do a right click and copy, but that includes all the line #s and code which is just too much noise, especially when you're getting hundreds of matches.
Anyone know how I can get these results to just the file name/paths? I'm after something like:
/about/foo.html
/about/bar.html
/faq/2012/awesome.html
/faq/2013/awesomer.html
/foo/bar/baz/wee.html
etc.
Then I can name that file regex_whatever_search.txt and at the top of it include the regex used for the search and replace. Below that, I've got my list of files it touched.
UPDATE What looks like the easiest thing to do (at least that I've found) is to just copy all the search results into a new text file and run the following regex:
^\tLine.+$
And replace that with an empty string. That'll give you just the file path and hit counts with a lot of empty space between each entry. Then run the following regex:
\s+\n
And replace with:
\n
That'll strip out all the unwanted empty space and you'll be left with a nice list.

maybe you need power of unix tools
assume you have GNUWin32 installed in c:\tools\gnuwin32
than if you have replace.bat file with that content:
#echo off
set BIN=c:\tools\gnuwin32\bin
set WHAT=%1
set TOWHAT=%2
set MASK=%3
rem Removing quotes
SET WHAT=###%WHAT%###
SET WHAT=%WHAT:"###=%
SET WHAT=%WHAT:###"=%
SET WHAT=%WHAT:###=%
SET TOWHAT=###%TOWHAT%###
SET TOWHAT=%TOWHAT:"###=%
SET TOWHAT=%TOWHAT:###"=%
SET TOWHAT=%TOWHAT:###=%
SET MASK=###%MASK%###
SET MASK=%MASK:"###=%
SET MASK=%MASK:###"=%
SET MASK=%MASK:###=%
echo %WHAT% replaces to %TOWHAT%
rem printing matching files
%BIN%\grep -r -c "%WHAT%" %MASK%
rem actual replace
%BIN%\find %MASK% -type f -exec %BIN%\sed -i "s/%WHAT%/%TOWHAT%/g" {} +
you can do regex replace in masked files recursively with output you required
replace "using System.Windows" "using Nothing" *.cs

The regulat expression I use for this kind of problem is
^\tLine.[0-9]*:.
And it works for me

This works well if you have Excel available and want to avoid using regular expressions:
Ctrl+A to select all the results
drag & drop the selected results to Excel
Create a Filter on the 1st row
Filter out the lines that have "(Blank)" on the 1st column
Select the remaining lines (i.e. the lines with the filenames) and copy/paste them to another sheet or any wanted destination
You could also Ctrl+A, Ctrl+C the search results, then use the Paste Option "Use Text Import Wizard" in Excel, say that the data is "Fixed width" and put one single break line after the 2nd character (to remove the two leading spaces in the filename during import), and use a filter to filter out the unwanted rows.

how do I loop through file names in stata

1) Is it possible to create a vector of strings in stata? 2) If yes, is it then possible to loop through the elements in this vector, performing commands on each element?
To create a single string in stata I know you do this:
local x = "a string"
But I have about 200 data files I need to loop through, and they are not conveniently named with consecutive suffixes like "_2000" "_2001" "_2002" etc. In fact there is no rhyme or reason to the file names, but I do have a list of them which I could easily cut and paste into a string vector, and then call the elements of this vector one by one, as one might do in MATLAB.
Is there a way to do this in stata?

On top of Keith's answer: you can also get the list of files in a directory with
local myfilelist : dir . files "*.dta"
or more generally
local theirfilelist : dir <directory name> files <file mask>
See help extended_fcn.

Sure -- You just create a list using a typical local call. If you don't put quotes around the whole thing your lists can be really long.
local mylist aaa bbb "cc c" dd ee ff
Then you just use foreach.
foreach filename of local mylist {
use `"`filename'"'
}
The double quotes (`" "') are used because one of the filenames has quotes around it because of the space. This is a touch faster than putting foreach filename in `mylist' { on the first line.
If you want to manipulate your list, see help macrolists.
Related questions have been asked >1 time on stackoverflow:
In Stata how do you assign a long list of variable names to a local macro?
Equivalent function of R's "%in%" for Stata

What many people might want the combination of the two as I did. Here it is:
* Create a local containing the list of files.
local myfilelist : dir "." files "*.dta"
* Or manually create the list by typing in the filenames.
local myfilelist "file1.dta" "file2.dta" "file3.dta"
* Then loop through them as you need.
foreach filename of local myfilelist {
use "`filename'"
}
I hope that helps. Note that locals/macros are limited by 67,784 characters--watch out for this when you have a really long list of files or really long filenames.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js