How to clean cells using SAS regex - sas

I have a table
id Attribute Other
1 Written Jan 20 File: 78yt8fgkje ....
2 12/22/2004 File: 3Bsdffsdf85 ....
3 12/17/2004 File: 5Osdfdsf58384 ....
4 Some May File: 0w98ejcj ....
5 10/24/2001 File: 2Ddsfsdfd1429 ....
....................
I need to remove everything that goes after the File: word in the Attribute variable
How can I do this?
I tried this solution from internet. It does not work and I do not understand what is 32767
data newDataSet;
set oldDataSet;
regex1 = prxparse("/ File:.*? /");
call prxchange(rx1, 32767, Attribute);
run;

PRX is probably overkill for this.
data want;
set have;
filepos = find(attribute,'File:');
if filepos>0 then attribute=substr(Attribute,1,filepos+5);
run;
Filepos+5 is to keep "File:" as you say "after". IF you want to get rid of "File:" as well, just get rid of the +5.

Related

How to format a variable in Stata?

I am trying to grab the formatted current data and create a variable from it using the command:
gen %tdCY-N-D final_dayinpt = date(c(current_date), "DMY")
However, I am getting an error
%tdCY invalid name
r(198);
If I display this at the Stata command line it works:
. display %tdCY-N-D date(c(current_date), "DMY")
2020-10-27
How can I create this formatted variable?
Solution:
set obs 10 // so the example works
generate final_dayinpt = date(c(current_date), "DMY")
format final_dayinpt %tdCY-N-D
The syntax you're trying is tempting because the format you use for display is seems analogous to things like generate byte bytevar = 1, but, as you found the analogy doesn't hold here.
Note that you are passing format information where type is expected based on the generate syntax (help generate):
generate [type] newvar[:lblname] =exp [if] [in] [, before(varname) | after(varname)]
While consulting help generate and help display are helpful, help datetime is also very useful here (and was never obvious to me).
See here for a more thorough treatment of working with dates in Stata.
Edit:
An alternative suggested (and made possible by) Nick Cox:
ssc install numdate // install the package which Nick wrote
generate current_date = c(current_date) // numdate takes a varlist
numdate daily final_dayinpt = current_date, pattern(DMY) format(%tdCY-N-D)

SAS Programming: Create New Column base on Suffix of Existing Column

I have a sas dataset called list that contains all files/path/filename of a directory.
sample dataset
I want to create a new column base on the suffix of column the_name to add 1, so 01 will become 02 and 02 will become 03.
For example:
the_name: FOR_PROCESSING_1234562020042002
new_name: FOR_PROCESSING_1234562020042003
the_name: FOR_PROCESSING_1234562020042101
new_name: FOR_PROCESSING_1234562020042102
Thanks for your help.
Recy:
A safer increment would be to scan then entire number off the end of the _name, and not rely on incrementing just a single tail end digit.
data _null_;
the_name = 'FOR_PROCESSING_1234562020042002';
suffix = scan(the_name,-1,'_');
nextnum = input(suffix,best20.)+1;
new_name = cats(transtrn(the_name,trim(suffix),''),nextnum);
put the_name= / new_name= ;
run;
--- LOG ---
the_name=FOR_PROCESSING_1234562020042002
new_name=FOR_PROCESSING_1234562020042003

How to assign the maximum amount of strings to macro automatically?

My question's title may be a little bit ambiguous.
Previously, I wanted to "acquire complete list of subdirs" and then read the files in these subdirs into Stata (see this post and this post).
Thanks to #Roberto Ferrer's great suggestion, I almost manage to do this. But I encountered another problem then. Because I have so many separate files, the length of local macro seems to hit its upper bound. After the command local n: word count Stata sends an error message:
macro substitution results in line that is too long.
The line resulting from substituting macros would be longer than allowed. The maximum allowed length is 645,216 characters, which is calculated on the basis of set maxvar. You can change that in Stata/SE and Stata/MP. What follows is relevant only if you are using Stata/SE or Stata/MP.
The maximum line length is defined as 16 more than the maximum macro length, which is currently 645,200 characters. Each unit increase in set maxvar increases the length maximums by 129.The maximum value of set maxvar is 32,767. Thus, the maximum line length may be set up to 4,227,159 characters if you set maxvar to its largest value.
r(920);
When I reduce the number of subdirs to 5, Stata works fine. Since having roughly 100 subdirs, I suppose to replicate the actions for 20 times. Well, it's manageable, but I still want to know if I can fully automate this process , more specifically, to "exhaust" the max allowable macro length,import the files and add another group of subdirs next time .
Below you can find my code:
//====================================
//=== read and clean projects data ===
//====================================
version 14
set linesize 80
set more off
clear
macro drop _all
set linesize 200
cd G:\Data_backup\Soufang_data
*----------------------------------
* Read all files within dictionary
*----------------------------------
* Import the first worksheets 1:"项目首页" 2:"项目概况" 3:"成交详情"
* worksheet1
filelist, directory("G:\Data_backup\Soufang_data") pattern(*.xlsx)
* Add pattern(*.xlsx) provent importing add file type( .doc or .dta)
gen tag = substr(reverse(dirname),1,6) == "esuoh/"
keep if tag==1
gen path = dirname+"\"+filename
qui valuesof path if tag==1
local filelist = r(values)
split dirname, parse("\" "/")
ren dirname4 citylist
drop dirname1-dirname3 dirname5
qui valuesof citylist if tag==1
local city = r(values)
local count = 1
local n:word count `filelist'
forval i = 1/`n' {
local file : word `i' of `filelist'
local cityname: word `i' of `city'
** don't add xlsx after `file', suffix has been added
** write "`file'" rather than `file', I don't know why but it works
qui import excel using "`file'",clear
cap qui sxpose,clear
cap qui drop in 1/1
gen city = "`cityname'"
if `count'==1 {
save house.dta,replace emptyok
}
else {
qui append using house
qui save house.dta,replace emptyok
}
local ++count
}
Thank you.
You do not need to store the whole list of files in a macro. filelist creates a database of files that you want to work with. Just save it and reload it for each file you want to process. You also use a very inefficient way to append datasets. As the appended dataset grows, the cost of reloading and saving it become very high and can slow down the whole process to a crawl.
Here's a sketch of how to process your Excel files
filelist, directory(".") pattern(*.xlsx)
save "myfiles.dta", replace
local n = _N
forval i = 1/`n' {
use in `i' using "myfiles.dta", clear
local f = dirname + "/" + filename
qui import excel using "`f'",clear
tempfile res`i'
save "`res`i''"
}
clear
forval i = 1/`n' {
append using "`res`i''"
}
save "final.dta", replace

Is it possible to use the 'where' statement in elasticnet (SAS)?

Here is the code I am using for variables selection:
proc glmselect data=abct;
where incex1=1;
title 'GLMSELECT with Elastic Net';
model devmood_c = asetot age yrseduc sex employyn cohabyn caucyn asitot penntot
anxdis ahealthuse ahospit ventxpwk acmn nhospit bmi comorb
aqllimmn aqlsubmn aqlsympmn aqlemotmn aqlenvirmn aqltotmn
smoke3gp nalcwkcurr
/selection=elasticnet(steps=120 L2=0.001 choose=validate);
run;
The problem is that, when I run it, it tells me:
ERROR: Variable incex1 is not on file WORK.ABCT.
This incex1 variable is used to exclude people in our database that have score too high on a particular question. It works with LASSO, but even though the code is similar, doesn't seem to work with elasticnet.
Does anyone know how I could use it or if there is another way to exclude the patients who scored under a certain threshold on a questionnaire?
This is how incex1 has been coded:
if devmood_c = 0 then incex1=1;
if devmood_c = 1 then incex1=1;
if devmood_c = . then incex1=0;
if bdisev > 2 then incex1=0;
label incex1 = "1=no mood at baseline or BDI > 20, 0=excluded";
This works in test data, so it is likely an issue with your source data not having the characteristics you expect. For example,
ods graphics on;
proc glmselect data=sashelp.Leutrain valdata=sashelp.Leutest
plots=coefficients;
where x1>0;
model y = x2-x7129/
selection=elasticnet(steps=120 l2=0.001 choose=validate);
run;
That works as expected.

How to retrieve data from multiple Stata files?

I have 53 Stata .dta files each of them is 150 - 200 Mb and contain identical set of variables, but for different years. It is not useful to combine or merge them due to their size .
I need to retrieve some averaged values (percentages etc.) Therefore, I want to create a new Stata file New.dta and write a .do file that would run on that new Stata file in the following way: it should open each of those 53 Stata files, make certain calulations, and store the results in the new Stata file, New.dta.
I am not sure how i can keep two Stata file open simultaneuosly, and how can i store the calculated values?
When I open a second .dta file, how can i make the first one still be open? How can i store the calculated values in the global variable?
What springs to mind here is the use of postfile.
Here is a simple example. First, I set up an example of several datasets. You already have this.
clear
forval i = 1/10 {
set obs 100
gen foo = `i' * runiform()
save test`i'
clear
}
Now I set up postfile. I need to set up a handle, what variables will be used, and what file will be used. Although I am using a numeric variable to hold file identifiers, it will perhaps be more typical to use a string variable. Also, looping over filenames may be a bit more challenging than this. fs from SSC is a convenience command that helps put a set of filenames into a local macro; its use is not illustrated here.
postfile mypost what mean using alltest.dta
forval i = 1/10 {
use test`i', clear
su foo, meanonly
post mypost (`i') (`r(mean)')
}
Now flush results
postclose mypost
and see what we have.
u alltest
list
+-----------------+
| what mean |
|-----------------|
1. | 1 .5110765 |
2. | 2 1.016858 |
3. | 3 1.425967 |
4. | 4 2.144528 |
5. | 5 2.438035 |
|-----------------|
6. | 6 3.030457 |
7. | 7 3.356905 |
8. | 8 4.449655 |
9. | 9 4.381101 |
10. | 10 5.017308 |
+-----------------+
I didn't use any global macros (not global variables) here; you should not need to.
An alternative approach is to loop over files and use collapse to "condense" these files to the relevant means, and than append these condensed files. Here is an adaptation of Nick's example:
// create the example datasets
clear
forval i = 1/10 {
set obs 100
gen foo = `i' * runiform()
gen year = `i'
save test`i', replace
clear
}
// use collapse and append
// to create the dataset you want
use test1, clear
collapse (mean) year foo
save means, replace
forvalues i = 2/10 {
use test`i', clear
collapse (mean) year foo
append using means
save means, replace
}
// admire the result
list
Note that if your data sets are not named sequentially like test1.dta, test2.dta, ..., test53.dta, but rather like results-alaska.dta, result_in_alabama.dta, ..., "wyoming data.dta" (note the space and hence the quotes), you would have to organize the cycle over these files somewhat differently:
local allfiles : dir . files "*.dta"
foreach f of local allfiles {
use `"`f'"', clear
* all other code from Maarten's or Nick's approach
}
This is a more advanced of local macros, see help extended macro functions. Note also that Stata will produce a list that will look like "results-alaska.dta" "result_in_alabama.dta" "wyoming data.dta" with quotes around file names, so when you invoke use, you will have to enclose the file name into compound quotes.