Is it possible to use the 'where' statement in elasticnet (SAS)? - sas

Here is the code I am using for variables selection:
proc glmselect data=abct;
where incex1=1;
title 'GLMSELECT with Elastic Net';
model devmood_c = asetot age yrseduc sex employyn cohabyn caucyn asitot penntot
anxdis ahealthuse ahospit ventxpwk acmn nhospit bmi comorb
aqllimmn aqlsubmn aqlsympmn aqlemotmn aqlenvirmn aqltotmn
smoke3gp nalcwkcurr
/selection=elasticnet(steps=120 L2=0.001 choose=validate);
run;
The problem is that, when I run it, it tells me:
ERROR: Variable incex1 is not on file WORK.ABCT.
This incex1 variable is used to exclude people in our database that have score too high on a particular question. It works with LASSO, but even though the code is similar, doesn't seem to work with elasticnet.
Does anyone know how I could use it or if there is another way to exclude the patients who scored under a certain threshold on a questionnaire?
This is how incex1 has been coded:
if devmood_c = 0 then incex1=1;
if devmood_c = 1 then incex1=1;
if devmood_c = . then incex1=0;
if bdisev > 2 then incex1=0;
label incex1 = "1=no mood at baseline or BDI > 20, 0=excluded";

This works in test data, so it is likely an issue with your source data not having the characteristics you expect. For example,
ods graphics on;
proc glmselect data=sashelp.Leutrain valdata=sashelp.Leutest
plots=coefficients;
where x1>0;
model y = x2-x7129/
selection=elasticnet(steps=120 l2=0.001 choose=validate);
run;
That works as expected.

Related

Multinomial Logit Fixed Effects: Stata and R

I am trying to run a multinomial logit with year fixed effects in mlogit in Stata (panel data: year-country), but I do not get standard errors for some of the models. When I run the same model using multinom in R I get both coefficients and standard errors.
I do not use Stata frequently, so I may be missing something or I may be running different models in Stata and R and should not be comparing them in the first place. What may be happening?
So a few details about the simple version of the model of interest:
I created a data example to show what the problem is
Dependent variable (will call it DV1) with 3 categories of -1, 0, 1 (unordered and 0 as reference)
Independent variables: 2 continuous variables, 3 binary variables, interaction of 2 of the 3 binary variables
Years: 1995-2003
Number of observations in the model: 900
In R I run the code and get coefficients and standard errors as below.
R version of code creating data and running the model:
## Fabricate example data
library(fabricatr)
data <- fabricate(
N = 900,
id = rep(1:900, 1),
IV1 = draw_binary(0.5, N = N),
IV2 = draw_binary(0.5, N = N),
IV3 = draw_binary(0.5, N = N),
IV4 = draw_normal_icc(mean = 3, N = N, clusters = id, ICC = 0.99),
IV5 = draw_normal_icc(mean = 6, N = N, clusters = id, ICC = 0.99))
library(AlgDesign)
DV = gen.factorial(c(3), 1, center=TRUE, varNames=c("DV"))
year = gen.factorial(c(9), 1, center=TRUE, varNames=c("year"))
DV = do.call("rbind", replicate(300, DV, simplify = FALSE))
year = do.call("rbind", replicate(100, year, simplify = FALSE))
year[year==-4]= 1995
year[year==-3]= 1996
year[year==-2]= 1997
year[year==-1]= 1998
year[year==0]= 1999
year[year==1]= 2000
year[year==2]= 2001
year[year==3]= 2002
year[year==4]= 2003
data1=cbind(data, DV, year)
data1$DV1 = relevel(factor(data1$DV), ref = "0")
## Save data as csv file (to use in Stata)
library(foreign)
write.csv(data1, "datafile.csv", row.names=FALSE)
## Run multinom
library(nnet)
model1 <- multinom(DV1 ~ IV1 + IV2 + IV3 + IV4 + IV5 + IV1*IV2 + as.factor(year), data = data1)
Results from R
When I run the model using mlogit (without fixed effects) in Stata I get both coefficients and standard errors.
So I tried including year fixed effects in the model using Stata three different ways and none worked:
femlogit
factor-variable and time-series operators not allowed
depvar and indepvars may not contain factor variables or time-series operators
mlogit
fe option: fe not allowed
used i.year: omits certain variables and/or does not give me standard errors and only shows coefficients (example in code below)
* Read file
import delimited using "datafile.csv", clear case(preserve)
* Run regression
mlogit DV1 IV1 IV2 IV3 IV4 IV5 IV1##IV2 i.year, base(0) iterate(1000)
Stata results
xtmlogit
error - does not run
error message: total number of permutations is 2,389,461,218; this many permutations require a considerable amount of memory and can result in long run times; use option force to proceed anyway, or consider using option rsample()
Fixed effects and non-linear models (such as logits) are an awkward combination. In a linear model you can simply add dummies/demean to get rid of a group-specific intercept, but in a non-linear model none of that works. I mean you could do it technically (which I think is what the R code is doing) but conceptually it is very unclear what that actually does.
Econometricians have spent a lot of time working on this, which has led to some work-arounds, usually referred to as conditional logit. IIRC this is what's implemented in femlogit. I think the mistake in your code is that you tried to include the fixed effects through a dummy specification (i.year). Instead, you should xtset your data and then run femlogit without the dummies.
xtset year
femlogit DV1 IV1 IV2 IV3 IV4 IV5 IV1##IV2
Note that these conditional logit models can be very slow. Personally, I'm more a fan of running two one-vs-all linear regressions (1=1 and 0/-1 set to zero, then -1=1 and 0/1 set to zero). However, opinions are divided (Wooldridge appears to be a fan too, many others very much not so).

How to format a variable in Stata?

I am trying to grab the formatted current data and create a variable from it using the command:
gen %tdCY-N-D final_dayinpt = date(c(current_date), "DMY")
However, I am getting an error
%tdCY invalid name
r(198);
If I display this at the Stata command line it works:
. display %tdCY-N-D date(c(current_date), "DMY")
2020-10-27
How can I create this formatted variable?
Solution:
set obs 10 // so the example works
generate final_dayinpt = date(c(current_date), "DMY")
format final_dayinpt %tdCY-N-D
The syntax you're trying is tempting because the format you use for display is seems analogous to things like generate byte bytevar = 1, but, as you found the analogy doesn't hold here.
Note that you are passing format information where type is expected based on the generate syntax (help generate):
generate [type] newvar[:lblname] =exp [if] [in] [, before(varname) | after(varname)]
While consulting help generate and help display are helpful, help datetime is also very useful here (and was never obvious to me).
See here for a more thorough treatment of working with dates in Stata.
Edit:
An alternative suggested (and made possible by) Nick Cox:
ssc install numdate // install the package which Nick wrote
generate current_date = c(current_date) // numdate takes a varlist
numdate daily final_dayinpt = current_date, pattern(DMY) format(%tdCY-N-D)

Precisions and counts

I am working with a educational dataset called IPEDS from the National Center for Educational Statistics. They track students in college based upon major, degree completion, etc. The problem in Stata is that I am trying to determine the total count for degrees obtained by a specific major.
They have a variable cipcode which contains values that serve as "majors". cipcode might be 14.2501 "petroleum engineering, 16.0102 "Linguistics" and so forth.
When I write a particular code like
tab cipcode if cipcode==14.2501
it reports no observations. What code will give me the totals?
/*Convert Float Variable to String Variable and use Force Replace*/
tostring cipcode, gen(cipcode_str) format(%6.4f) force
replace cipcode_str = reverse(substr(reverse(cipcode_str), indexnot(reverse(cipcode_str), "0"), .))
replace cipcode_str = reverse(substr(reverse(cipcode_str), indexnot(reverse(cipcode_str), "."), .))
/* Created a total variable called total_t1 for total count of all stem majors listed in table 1*/
gen total_t1 = cipcode_str== "14.2501" + "14.3901" + "15.0999" + "40.0601"
This minimal example confirms your problem. (See, by the way, https://stackoverflow.com/help/mcve for advice on good examples.)
* code
clear
input code
14.2501
14.2501
14.2501
end
tab code if code == 14.2501
tab code if code == float(14.2501)
* results
. tab code if code == 14.2501
no observations
. tab code if code == float(14.2501)
code | Freq. Percent Cum.
------------+-----------------------------------
14.2501 | 3 100.00 100.00
------------+-----------------------------------
Total | 3 100.00
The keyword is one you use, precision. In Stata, search precision for resources, starting with blog posts by William Gould. A decimal like 14.2501 is hard (impossible) to hold exactly in binary and the details of holding a variable as type float can bite.
It's hard to see what you're doing with your last block of code, which you don't explain. The last statement looks puzzling, as you're adding strings. Consider what happens with
. gen whatever = "14.2501" + "14.3901" + "15.0999" + "40.0601"
. di whatever[1]
14.250114.390115.099940.0601
The result is a long string that cannot be a valid cipcode. I suspect that you are reaching towards
... if inlist(cipcode_str, "14.2501", "14.3901", "15.0999", "40.0601")
which is quite different.
But using float() is the minimal trick for this problem.

Setting bounds in order to work on MMM modeling using SAS Procedures

I am trying to build a MM model using a SAS procedure proc nlin on the following columns (attaching 4 lines of the actual data):
ACVDO ACVFO ACVFD ACVPR
7.84 12.82 1.21 44.34
4.96 22.54 2.77 40.69
6.51 12.23 0.63 32.42
4.97 13.3 2.46 37.06
& the code I used is as follows:
proc nlin data=Bags UNCORRECTEDDF;
parameters b0=0 b1=0 b2=0 b3=0.004,b4=0 ;
bounds 0.004<b3<=0.00717 ;*/
bounds b4>0.00253;
model mi_EQ = b0+b1*ACVDO+b2*ACVFO+b3*ACVFD+b4*ACVPR;
output out=nlinout predicted=pred ;
run;
There are other variables also that has to be included in the model but the problem that I am facing with this code is that in spite of setting a range the coefficients that I get are taking either of the extremes & not something within the range specified.
Can someone help me out as to how can I fix this issue?Is that I am missing out something while writing the code ?

Stata pie chart error

Hi I'm trying to create a pie chart that has a lot of slices. For some reason I get an error when running this code.
My code
graph pie ccounter if year==1900 & ccounter>100 & labforce==2, over(occ1950)
and I get this error
(note: areastyle p193pie not found in scheme, default attributes used)
(note: areastyle p194pie not found in scheme, default attributes used)
(note: areastyle p195pie not found in scheme, default attributes used)
(note: areastyle p196pie not found in scheme, default attributes used)
option min() incorrectly specified
Note that the variable occ1950 has more than 100 values. I don't know whether this is what causing the problem.
Extra Information
I use this code to create the variable ccounter
bys mcdstr year occ: gen counter=_n
bys mcdstr year occ: egen ccounter=max(counter)
I used this to calculate the number of people working in each industry by year and location.
The problem lies in that the variable occ1950 has too much unique values. Let us examine the problem using a CSV dataset of only 40 countries.
country,fdi
Afghanistan,141.391
Algeria,541.478
Angola,238.637
Antigua and Barbuda,1.653
Argentina,205.691
Bahamas,21.927
Bahrain,1.317
Bangladesh,50.298
Barbados,2.816
"Bolivia, Plurinational State of",41.572
Botswana,87.649
Brazil,455.5649999999999
British Virgin Islands,12387.568
Brunei Darussalam,21.02
Cambodia,672.6800000000001
Cameroon,30.159
Cape Verde,3.783
Cayman Islands,15323.116
Chile,53.149
Colombia,49.047
Congo,112.104
"Congo, Democratic Rep. of",302.505
Costa Rica,.826
Côte d' Ivoire,27.099
Dominican Republic,.112
Ecuador,93.673
Egypt,191.59
Equatorial Guinea,80.21700000000001
Eritrea,16.269
Ethiopia,205.824
Fiji,38.742
Gabon,76.66500000000001
Ghana,129.068
Guinea,97.413
Guyana,79.899
Honduras,2.6
"Hong Kong, China",124987.422
India,291.567
Indonesia,850.0709999999999
"Iran, Islamic Republic of",480.71
After loading this into Stata 14, we can observe that
graph pie fdi, over(country)
produces the error: option min() incorrectly specified.
If we now reduce the dataset to simply 30 countries by: drop _n > 30. We would be able to get a pie chart.
This suggests that you should collapse your data, take the n categories with the largest ccounter and then classify the other classes as "other".
The magic number is 36. So you can have at most 36 unique categories in your pie chart.