I was working on a problem that involved creating dummy variables, but I ran into an issue where I'm having missing values for the dummy variables in the corresponding reference category even though the dataset doesn't have missing values. Even if I'm selecting one of the categories to be the reference category or variable, shouldn't the dummy variable values be zero? I had the same issue even when I did not account for missing values. I've included my code, log, output, and the content of the text file for context and so that my question will be clearer.
The part of the homework assignment that I'm having issues with is the following:
Fibromyalgia is a syndrome of widespread body pain that is often treated by rheumatologists. One way of measuring the impact of fibromyalgia on patients is the Fibromyalgia Impact Questionnaire (FIQ). On the FIQ, high values show greater impact of disease (bad) and low values show lesser impact of disease (good). We have data on women with fibromyalgia who attended one of two types of disease self-management classes or who received standard care (the control group).
Data from this study are in the file fibr03_sum18.txt on the BS 805 web site in the Assignments section for Class 6. The variables in the data file are:
FIQ score (3.1 format) taken after the classes Group (1 = class 1, 2 = class 2, 3 = standard care) Disease Severity (On a scale of 1 to 6) before the classes Age (years) Since the data were entered into this file, information on a new patient and a correction to the data have been found. The new patient is in the control group, has FIQ = 8.2, Disease Severity =2, and Age = 25 years. The correction is that the second subject in class 1 was 17 rather than 18 years old.
A) Create a temporary SAS data set using these data. In the data set, create a set of indicator variables that code for group membership. Use PROC PRINT to list the data.
I read in the text file using column input, but I think it can be read in using list input as well? The text file contained the data below was the file was called: fibr03_sum18.txt.
3.1 1 6 21
1.8 1 6 18
3.3 1 5 22
2.9 1 4 15
4.3 1 3 24
4.8 1 3 22
4.9 1 2 17
6.4 1 2 18
5.7 2 5 17
6.1 2 5 25
8.5 2 3 31
7.1 2 2 17
7.7 2 1 25
9.8 2 1 22
5.1 3 4 23
7.2 3 1 15
8.3 3 1 22
6.7 3 2 20
My code for reading in the data and creating the temporary dataset with the dummy variables was:
*Part A: Reading in Data and Creating a Temporary Dataset;
libname HW6 'C:\Users\jackz\Desktop\SAS';
filename HW6new 'C:\Users\jackz\Desktop\SAS\fibr03_sum18.txt';
proc format;
value grpf 1='class 1' 2='class 2' 3='standard care';
run;
data one;
infile HW6new;
input #1 FIQ 3.1 #5 grp 1. #7 disev 1. #9 age 2.;
*Creating Dummy Variables;
if grp=1 then classc1=1; else if grp=2 then classc1=0;
if grp=2 then classc2=1; else if grp=1 then classc2=0;
if grp=. then classc1=.;
if grp=. then classc2=.;
label FIQ='FIQ Score'
grp='Group'
disev='Disease Severity'
age='Age';
format grp grpf.;
run;
*Printout of Dataset one;
proc print data=one label;
run;
My log for this code was:
NOTE: Copyright (c) 2016 by SAS Institute Inc., Cary, NC, USA.
NOTE: SAS (r) Proprietary Software 9.4 (TS1M5)
Licensed to BOSTON UNIVERSITY - SFA T&R, Site 70009029.
NOTE: This session is executing on the W32_10HOME platform.
NOTE: Updated analytical products:
SAS/STAT 14.3
SAS/ETS 14.3
SAS/OR 14.3
SAS/IML 14.3
SAS/QC 14.3
NOTE: Additional host information:
W32_10HOME WIN 10.0.16299 Workstation
NOTE: SAS initialization used:
real time 0.96 seconds
cpu time 0.95 seconds
1 *Part A: Reading in Data and Creating a Temporary Dataset;
2 libname HW6 'C:\Users\jackz\Desktop\SAS';
NOTE: Libref HW6 was successfully assigned as follows:
Engine: V9
Physical Name: C:\Users\jackz\Desktop\SAS
3 filename HW6new 'C:\Users\jackz\Desktop\SAS\fibr03_sum18.txt';
4 proc format;
5 value grpf 1='class 1' 2='class 2' 3='standard care';
NOTE: Format GRPF has been output.
6 run;
NOTE: PROCEDURE FORMAT used (Total process time):
real time 0.01 seconds
cpu time 0.01 seconds
7 data one;
8 infile HW6new;
9 input #1 FIQ 3.1 #5 grp 1. #7 disev 1. #9 age 2.;
10 *Creating Dummy Variables;
11 if grp=1 then classc1=1; else if grp=2 then classc1=0;
12 if grp=2 then classc2=1; else if grp=1 then classc2=0;
13 if grp=. then classc1=.;
14 if grp=. then classc2=.;
15 label FIQ='FIQ Score'
16 grp='Group'
17 disev='Disease Severity'
18 age='Age';
19 format grp grpf.;
20 run;
NOTE: The infile HW6NEW is:
Filename=C:\Users\jackz\Desktop\SAS\fibr03_sum18.txt,
RECFM=V,LRECL=32767,File Size (bytes)=214,
Last Modified=15Jun2018:12:56:26,
Create Time=15Jun2018:12:56:26
NOTE: 18 records were read from the infile HW6NEW.
The minimum record length was 10.
The maximum record length was 10.
NOTE: The data set WORK.ONE has 18 observations and 6 variables.
NOTE: DATA statement used (Total process time):
real time 0.03 seconds
cpu time 0.03 seconds
21 *Printout of Dataset one;
22 proc print data=one label;
NOTE: Writing HTML Body file: sashtml.htm
23 run;
NOTE: There were 18 observations read from the data set WORK.ONE.
NOTE: PROCEDURE PRINT used (Total process time):
real time 0.27 seconds
cpu time 0.06 seconds
Here is the output, although it is not lined up:
The SAS System
Obs FIQ Score Group Disease
Severity Age classc1 classc2
1 3.1 class 1 6 21 1 0
2 1.8 class 1 6 18 1 0
3 3.3 class 1 5 22 1 0
4 2.9 class 1 4 15 1 0
5 4.3 class 1 3 24 1 0
6 4.8 class 1 3 22 1 0
7 4.9 class 1 2 17 1 0
8 6.4 class 1 2 18 1 0
9 5.7 class 2 5 17 0 1
10 6.1 class 2 5 25 0 1
11 8.5 class 2 3 31 0 1
12 7.1 class 2 2 17 0 1
13 7.7 class 2 1 25 0 1
14 9.8 class 2 1 22 0 1
15 5.1 standard care 4 23 . .
16 7.2 standard care 1 15 . .
17 8.3 standard care 1 22 . .
18 6.7 standard care 2 20 . .
You can see that there are missing values for the dummy variables classc1 and classc2 even though there are no missing values in the original dataset. Should those values read 0, since group 3 does not fall in either grp=1 or grp=2?
Can anyone give me any hints as to what I have done wrong, if I have done anything wrong? Thanks for all of your help!
The output shows that the rows where the flag variables are missing values have group = 3 (standard care). The missing values are not missing due to the if statements, but due to the implicit resetting of data step variables to missing at the start of the implicit loop.
When group=3, there is no if statement that causes the flags variables to change from their initial 'reset to missing'
* when grp=3 neither classic1 nor classic2 variable is changed from its initial missing value;
put 'NOTE: ' _n_= (classic:) (=);
if grp=1 then classc1=1; else if grp=2 then classc1=0;
if grp=2 then classc2=1; else if grp=1 then classc2=0;
if grp=. then classc1=.;
if grp=. then classc2=.;
put 'NOTE: ' _n_= (classic:) (=);
Related
I have an unbalanced panel dataset of the following form (simplified):
data have;
input ID YEAR EARN LAG_EARN;
datalines;
1 1960 450 .
1 1961 310 450
1 1962 529 310
2 1978 10 .
2 1979 15 10
2 1980 8 15
2 1981 10 8
2 1982 15 10
2 1983 8 15
2 1984 10 8
3 1972 1000 .
3 1973 1599 1000
3 1974 1599 1599
;
run;
I now want to estimate the following model for each ID:
proc reg;
by ID;
EARN = LAG_EARN;
run;
However, I want to do this for rolling windows of some size. Say for example for windows of size 2. The window should only contain non-empty observations. For example, in the case of firm A, the window is applicable from 1961 onwards and thus only one time (since only one year follows after 1961 and the window is supposed to be of size 2).
Finally, I want to get a table with year columns and firm rows. The table should indicate the following: The regression model (with window size 2) has been performed one time for firm A. The quantity of available years, has only allowed one estimation of this model. Put differently, in 1962 the coefficient of the regression model has a value of X based on the 2 year prior window. Applying the same logic to the other two firms, one can get the following table. "X" representing the respective estimated coefficient value in certain year for firm A/B/C based on the 2-year window and "n" indicating the non-existence of such a value:
data want;
input ID 1962 1974 1980 1981 1982 1983 1984;
datalines;
1 X n n n n n n
2 n n X X X X X
3 n X n n n n n
;
run;
I do not know how to execute this. Furthermore, I would like to create a macro that allows me to estimate different rolling window models while still creating analogous output dataframes. I would appreciate any help with it, since I have been struggling quite some time now.
Try this macro. This will only output if there are non-missing values of lags that you specify.
%macro lag(data=, out=, window=);
data _want_;
set &data.;
by ID;
LAG_EARN = lag&window.(earn);
if(first.ID) then call missing(lag_earn);
if(NOT missing(lag_earn));
run;
proc sort data=_want_;
by year id;
run;
proc transpose data=_want_
out=&out.(drop=_NAME_);
by ID notsorted;
id year;
var lag_earn;
run;
proc sort data=&out.;
by id;
run;
%mend;
%lag(data=have, out=want, window=1);
I am so frustrated. I can't even get a proc print to work. I've tried so many things. I don't see the table in results viewer. My log says the file has been read and that I should see results. I've tried turning ods off and on and saving to work folder or saving to my own folder. I've tried switching to a list output. Right now, I just want this code to run which I got from: https://support.sas.com/resources/papers/proceedings11/270-2011.pdf .
data energy;
length state $2;
input region division state $ type expenditures ##;
datalines;
1 1 ME 1 708 1 1 ME 2 379 1 1 NH 1 597 1 1 NH 2 301
1 1 VT 1 353 1 1 VT 2 188 1 1 MA 1 3264 1 1 MA 2 2498
1 1 RI 1 531 1 1 RI 2 358 1 1 CT 1 2024 1 1 CT 2 1405
1 2 NY 1 8786 1 2 NY 2 7825 1 2 NJ 1 4115 1 2 NJ 2 3558
1 2 PA 1 6478 1 2 PA 2 3695 4 3 MT 1 322 4 3 MT 2 232
4 3 ID 1 392 4 3 ID 2 298 4 3 WY 1 194 4 3 WY 2 184
4 3 CO 1 1215 4 3 CO 2 1173 4 3 NM 1 545 4 3 NM 2 578
4 3 AZ 1 1694 4 3 AZ 2 1448 4 3 UT 1 621 4 3 UT 2 438
4 3 NV 1 493 4 3 NV 2 378 4 4 WA 1 1680 4 4 WA 2 1122
4 4 OR 1 1014 4 4 OR 2 756 4 4 CA 1 10643 4 4 CA 2 10114
4 4 AK 1 349 4 4 AK 2 329 4 4 HI 1 273 4 4 HI 2 298
;
proc sort data=energy out=energy_report;
by region division type;
run;
proc format;
value regfmt 1='Northeast'
2='South'
3='Midwest'
4='West';
value divfmt 1='New England'
2='Middle Atlantic'
3='Mountain'
4='Pacific';
value usetype 1='Residential Customers'
2='Business Customers';
run;
ods html file='my_report.html';
proc print data=energy_report;
run;
ods html close;
My log shows no errors:
NOTE: Writing HTML Body file: my_report.html
1582 proc print data=energy_report;
1583 run;
NOTE: There were 44 observations read from the data set WORK.ENERGY_REPORT.
NOTE: PROCEDURE PRINT used (Total process time):
real time 0.04 seconds
cpu time 0.00 seconds
When I go into my temporary files, I can open the "energy" and "energy_report" data set and I can view all the data. Why can't I see a print output? I'm not sure what I'm missing. I checked the output window, the results viewer window, and all the generated html files. They're all blank.
Thank you
It depends a lot on your set up, but I would enable HTML & Listing output and then check the output.
ods listing;
ods html;
proc print data=sashelp.class;
run;
If you're using EG the results should be in the process flow. If Studio, in the Results tab, if SAS Base, click on Results and open if necessary.
There is an option called 'Show Results as Generated' and it's possible it's been set to off in your installation for some reason. I often set mine up this way because I often generate a lot of files at once (HTML/XLSX) and don't want them to open up automatically.
Where you print to my_report.html, the file will probably be trying to go to C:\my_report.html - put in a full file path instead, and check that when you're done.
change
ods html file='my_report.html';
proc print data=energy_report;
run;
ods html close;
to
ods html file="&path./my_report4.html";
proc print data=energy_report;
run;
ods html close;
where &path contains the path where the file will be created.
And important : Use " instead of '. Double quote in the place of a quote.
I have a panel data set with an id, date, and multiple variables. I'm trying to get the skewness and std dev of "var1" listed by id for a certain date range. I know those items are in the summary detail for "var1", but can't seem to find a way to get it listed by id for my specified date range.
Any help would be greatly appreciated!
Here is an example that may start you on your path.
. webuse pig
(Longitudinal analysis of pig weights)
. xtset id week
panel variable: id (strongly balanced)
time variable: week, 1 to 9
delta: 1 unit
. bysort id: egen sk = skew(weight) if inrange(week,3,8)
(144 missing values generated)
. list if id==1, clean
id week weight sk
1. 1 1 24 .
2. 1 2 32 .
3. 1 3 39 .0709604
4. 1 4 42.5 .0709604
5. 1 5 48 .0709604
6. 1 6 54.5 .0709604
7. 1 7 61 .0709604
8. 1 8 65 .0709604
9. 1 9 72 .
I am new to SAS, so this might be a silly type of question.
Assume there are several datasets with similar structure but different column names. I want to get new datasets with the same number of rows but only a subset of columns.
In the following example, Data_A and Data_B are original datasets and SubA and SubBare what I want. What is the efficient way of deriving SubA and SubB?
DATA A_auto;
LENGTH A_make $ 20;
INPUT A_make $ 1-17 A_price A_mpg A_rep78 A_hdroom A_trunk A_weight A_length A_turn A_displ A_gratio A_foreign;
CARDS;
AMC Concord 4099 22 3 2.5 11 2930 186 40 121 3.58 0
AMC Pacer 4749 17 3 3.0 11 3350 173 40 258 2.53 0
Audi Fox 6295 23 3 2.5 11 2070 174 36 97 3.70 1
;
RUN;
DATA B_auto;
LENGTH make $ 20;
INPUT B_make $ 1-17 B_price B_mpg B_rep78 B_hdroom B_trunk B_weight B_length B_turn B_displ B_gratio B_foreign;
CARDS;
Toyota Celica 5899 18 5 2.5 14 2410 174 36 134 3.06 1
Toyota Corolla 3748 31 5 3.0 9 2200 165 35 97 3.21 1
VW Scirocco 6850 25 4 2.0 16 1990 156 36 97 3.78 1
;
RUN;
DATA SubA;
set A_auto;
keep A_make A_price;
RUN;
DATA SubB;
set B_auto;
keep B_make B_price;
RUN;
Here's my new answer. This introduces quite a few concepts, but all are necessary to complete this task.
First of all I would store the required part variable names (the suffixes that are common to all datasets) in a new dataset. This keeps them all in one place and makes it easier to change if required.
The next step is to create a regular expression (regex) search string that combines all the names, separated by a pipe (|), which is the regex symbol for or. I've also added a $ symbol to end of the names, this ensures only variables ending with the part names will be selected.
select into :[macroname] is the method to create macro variables within proc sql
Then I set up a macro to extract the specific variable names for the current dataset and use those names to create a view (like my original answer)
The dictionary library referenced in the proc sql is a metadata library that contains information on all active libraries, tables, columns etc, so is a good source of identifying what the actual variable names are called (based on the regex search string created earlier).
You won't need the proc print in your code, I just put it in to show everything is working as expected.
Let me know if this works for you
/* create intial datasets */
DATA A_auto;
LENGTH A_make $ 20;
INPUT A_make $ 1-17 A_price A_mpg A_rep78 A_hdroom A_trunk A_weight A_length A_turn A_displ A_gratio A_foreign;
CARDS;
AMC Concord 4099 22 3 2.5 11 2930 186 40 121 3.58 0
AMC Pacer 4749 17 3 3.0 11 3350 173 40 258 2.53 0
Audi Fox 6295 23 3 2.5 11 2070 174 36 97 3.70 1
;
RUN;
DATA B_auto;
LENGTH B_make $ 20;
INPUT B_make $ 1-17 B_price B_mpg B_rep78 B_hdroom B_trunk B_weight B_length B_turn B_displ B_gratio B_foreign;
CARDS;
Toyota Celica 5899 18 5 2.5 14 2410 174 36 134 3.06 1
Toyota Corolla 3748 31 5 3.0 9 2200 165 35 97 3.21 1
VW Scirocco 6850 25 4 2.0 16 1990 156 36 97 3.78 1
;
RUN;
/* create dataset containing partial name of variables to keep */
data keepvars;
input part_name $ :20.;
datalines;
_make
_price
;
run;
/* create regular expression search string from partial names */
proc sql noprint;
select
cats(part_name,'$') /* '$' matches end of string */
into
:name_str separated by '|' /* '|' is an 'or' search operator in regular expressions */
from
keepvars;
quit;
%put &name_str.; /* print search string to log */
/* macro to create views from datasets */
%macro create_views (dsname, vwname); /* inputs are dataset name being read in and view name being created */
/* extract specific variable names to be kept, based on search string */
proc sql noprint;
select
name
into
:vars separated by ' '
from
dictionary.columns
where
libname = 'WORK'
and memname = upper("&dsname.")
and prxmatch("/&name_str./",strip(name))>0; /* prxmatch is regular expression search function */
quit;
%put &vars.; /* print variables to keep to log */
/* create views */
data &vwname. / view=&vwname.;
set &dsname. (keep=&vars.);
run;
/* test view by printing */
proc print data=&vwname.;;
run;
%mend create_views;
/* run macro for each dataset */
%create_views(A_auto, SubA);
%create_views(B_auto, SubB);
I have big panel time series data set. I wish to do this basic SAS regression code:
proc sort data=dataset;
by time_id;
run;
ods output parameterestimates=pe;
proc reg data=dataset;
by time_id;
model y=x1 x2 x3....x15;
quit;
run;
I get this error when I run the code:
ERROR: No valid observations are found.
NOTE: The above message was for the following BY group:
time_id=1
ERROR: No valid observations are found.
NOTE: The above message was for the following BY group:
time_id=2....
Why? My time_id variable exists... is it because I have too many time_id variables? If I select firm_id it works but I want time_id.
Here's a sample of my data (panel time series):
y x firm_id time_id
3.4 100 1 1
2.3 200 1 2
6.5 653 1 3
3 50 2 1
4.34 23 2 2
4.8 55 2 3
1.311 400 3 1
1.23 200 3 2
5.63 50 3 3
You'll get this error message if all values of a particular x variable are missing for a given time_id. Take a look at the example below where all values of x2 are missing for time_id 1, when you run the code the Results Output window details the problem (number of missing observations the same as the number of observations).
It works for firm_id because you have fewer values than time_id, therefore not all values of a particular x variable are missing for each firm_id.
data have;
input y x1 x2 firm_id time_id;
cards;
3.4 100 . 1 1
2.3 200 200 1 2
6.5 653 653 1 3
3 50 . 2 1
4.34 23 23 2 2
4.8 55 55 2 3
1.311 400 . 3 1
1.23 200 200 3 2
5.63 50 50 3 3
;
run;
proc sort data=have;
by time_id;
run;
ods output parameterestimates=pe;
proc reg data=have;
by time_id;
model y=x1-x2;
quit;
run;