I am very new to SAS and want to create a simple dummy variable (MALE) that equals 1 if SEX = 1, and equals 0 if SEX = 2. However, I get error messages: ERROR: The decimal specification of 2 must be less than the width specification of 1.
How do I solve this? This is the code I use:
DATA WORK.BMI_D ;
SET WORK.BMI ;
IF SEX = 1 THEN MALE = 1;
ELSE MALE = 0;
RUN;
The variable SEX has length 8, type Numeric and format F8.2. What am I doing wrong?
You have not showed the code that is generating that error message but why not just remove the illogical format that you have attached to the variable SEX. Perhaps the error is from later step that is trying to display SEX with a width of only 1 byte and is having trouble display the strings like 1.00 or 2.00 that the F8.2 format would generate.
Since there is no need to use a special display format for numeric values of 1 and 2 just remove the format from SEX and see if that solves the issue.
DATA WORK.BMI_D ;
SET WORK.BMI ;
IF SEX = 1 THEN MALE = 1;
ELSE MALE = 0;
format sex ;
RUN;
Related
referring to below code, after I transpose a data-set (output qc2), I tried to create a percentage column (most_recent_wk_percent_change) but the result of the column is 12.5% with two new columns - &week3. and &week2. created. The expected result is to calculate based on the values in week2 and week3 columns. I know the problem could be the referencing of the two columns in the percentage calculation (==> ( &week3. - &week2.)/&week2.;) , but I couldn't put my head to the correction. pls advise :)
%let week1 = 7;
%let week2 = 8;
%let week3 = 9;
proc sql;
create table qc as
select t_week, prod_cat, sum(sales) as sales
from master_table
where t_week in (&week1.,&week2.,&week3.)
group by 1,2
order by 2;
quit;
proc transpose data= qc out=qc2;
format
by prod_cat ;
id t_week;
run;
data qc2;
set qc2;
format most_recent_wk_percent_change PERCENT7.1;
most_recent_wk_percent_change = ( &week3. - &week2.)/&week2.;
run;
qc:
t_week|prod_cat|sales
7|cat|100
8|cat|200
9|cat|300
7|dog|150
8|dog|400
9|dog|300
7|rat|200
8|rat|600
9|rat|300
qc2: (TRANSPOSED TABLE --> note the column name of 7,8,9. (which is expected)
prod_cat|7|8|9
cat|100|200|300
dog|150|400|300
rat|200|600|300
qc2: (i wanted to get the change in % )
prod_cat|7|8|9|most_recent_wk_percent_change|&week2.|&week3.
cat|100|200|300|12.5%|.|.| ==> 12.5% is wrong. should be 50% (300-200)/(200)
dog|150|400|300|12.5%|.|.| ==> 12.5% is wrong. should be -25%
rat|200|600|300|12.5%|.|.| ==> 12.5% is wrong. should be -50%
I have no idea what you are doing or why, but if you have set VALIDVARNAME=any and the actual name of your variable is 7 and you try to use it in SAS code like this:
ratio = 7/8 ;
Then SAS will assume you mean the numeric value 7.
You need to use a name literal instead.
ratio = '7'n / '8'n ;
So you want
most_recent_wk_percent_change = ("&week3"n-"&week2"n)/"&week2"n;
If instead the actual name of the variable is _7 then you need to code this way.
most_recent_wk_percent_change = (_&week3.-_&week2.)/_&week2.;
Try adding a keep statement to your last data step, this will only keeps the columns you want in the output.
data qc2 (keep= most_recent_wk_percent_change prod_cat);
set qc2;
format most_recent_wk_percent_change PERCENT7.1;
most_recent_wk_percent_change = ( &week3. - &tweek2.)/&week2.;
run;
I am trying to develop a recursive program to in missing string values using flat probabilities (for instance if a variable had three possible values and one observation was missing, the missing observation would have a 33% of being replace with any value).
Note: The purpose of this post is not to discuss the merit of imputation techniques.
DATA have;
INPUT id gender $ b $ c $ x;
CARDS;
1 M Y . 5
2 F N . 4
3 N Tall 4
4 M Short 2
5 F Y Tall 1
;
/* Counts number of categories i.e. 2 */
proc sql;
SELECT COUNT(Unique(gender)) into :rescats
FROM have
WHERE Gender ~= " " ;
Quit;
%let rescats = &rescats;
%put &rescats; /*internal check */
/* Collects response categories separated by commas i.e. F,M */
proc sql;
SELECT UNIQUE gender into :genders separated by ","
FROM have
WHERE Gender ~= " "
GROUP BY Gender;
QUIT;
%let genders = &genders;
%put &genders; /*internal check */
/* Counts entries to be evaluated. In this case observations 1 - 5 */
/* Note CustomerKey is an ID variable */
proc sql;
SELECT COUNT (UNIQUE(customerKey)) into :ID
FROM have
WHERE customerkey < 6;
QUIT;
%let ID = &ID;
%put &ID; /*internal check */
data want;
SET have;
DO i = 1 to &ID; /* Control works from 1 to 5 */
seed = 12345;
/* Sets u to rand value between 0.00 and 1.00 */
u = RanUni(seed);
/* Sets rand gender to either 1 and 2 */
RandGender = (ROUND(u*(&rescats - 1)) + 1)*1;
/* PROBLEM Should if gender is missing set string value of M or F */
IF gender = ' ' THEN gender = SCAN(&genders, RandGender, ',');
END;
RUN;
I the SCAN function does not create a F or M observation within gender. It also appears to create a new M and F variable. Additionally the DO Loop creates addition entry under within CustomerKey. Is there any way to get rid of these?
I would prefer to use loops and macros to solve this. I'm not yet proficient with arrays.
Here is my attempt at tidying this up a little:
/*Changed to delimited input so that values end up in the right columns*/
DATA have;
INPUT id gender $ b $ c $ x;
infile cards dlm=',';
CARDS;
1,M,Y, ,5
2,F,N, ,4
3, ,N,Tall,4
4,M, ,Short,2
5,F,Y,Tall,1
;
/*Consolidated into 1 proc, addded noprint and removed unnecessary group by*/
proc sql noprint;
/* Counts number of categories i.e. 2 */
SELECT COUNT(unique(gender)) into :rescats
FROM have
WHERE not(missing(Gender));
/* Collects response categories separated by commas i.e. F,M */
SELECT unique gender into :genders separated by ","
FROM have
WHERE not(missing(Gender))
;
Quit;
/*Removed redundant %let statements*/
%put rescats = &rescats; /*internal check */
%put genders = &genders; /*internal check */
/*Removed ID list code as it wasn't making any difference to the imputation in this example*/
data want;
SET have;
seed = 12345;
/* Sets u to rand value between 0.00 and 1.00 */
u = RanUni(seed);
/* Sets rand gender to either 1 or 2 */
RandGender = ROUND(u*(&rescats - 1)) + 1;
IF missing(gender) THEN gender = SCAN("&genders", RandGender, ','); /*Added quotes around &genders to prevent SAS interpreting M and F as variable names*/
RUN;
Halo8:
/*Changed to delimited input so that values end up in the right columns*/
DATA have;
INPUT id gender $ b $ c $ x;
infile cards dlm=',';
CARDS;
1,M,Y, ,5
2,F,N, ,4
3, ,N,Tall,4
4,M, ,Short,2
5,F,Y,Tall,1
;
run;
Tip: You can use a dot (.) to mean a missing value for a character variable during INPUT.
Tip: DATALINES is the modern alternative to CARDS.
Tip: Data values don't have to line up, but it helps humans.
Thus this works as well:
/*Changed to delimited input so that values end up in the right columns*/
DATA have;
INPUT id gender $ b $ c $ x;
DATALINES;
1 M Y . 5
2 F N . 4
3 . N Tall 4
4 M . Short 2
5 F Y Tall 1
;
run;
Tip: Your technique requires two passes over the data.
One to determine the distinct values.
A second to apply your imputation.
Most approaches require two passes per variable processed. A hash approach can do only two passes but requires more memory.
There are many ways to deteremine distinct values: SORTING+FIRST., Proc FREQ, DATA Step HASH, SQL, and more.
Tip: Solutions that move data to code back to data are sometimes needed, but can be troublesome. Often the cleanest way is to let data remain data.
For example: INTO will be the wrong approach if the concatenated distinct values would require more than 64K
Tip: Data to Code is especially troublesome for continuous values and other values that are not represented exactly the same when they become code.
For example: high precision numeric values, strings with control-characters, strings with embedded quotes, etc...
This is one approach using SQL. As mentioned before, Proc SURVEYSELECT is far better for real applications.
Proc SQL;
Create table REPLACEMENTS as select distinct gender from have where gender is NOT NULL;
%let REPLACEMENT_COUNT = &SQLOBS; %* Tip: Take advantage of automatic macro variable SQLOBS;
data REPLACEMENTS;
set REPLACEMENTS;
rownum+1; * rownum needed for RANUNI matching;
run;
Proc SQL;
* Perform replacement of missing values;
Update have
set gender =
(
select gender
from REPLACEMENTS
where rownum = ceil(&REPLACEMENT_COUNT * ranuni(1234))
)
where gender is NULL
;
%let SYSLAST = have;
DM 'viewtable have' viewtable;
You don't have to be concerned about columns not having a missing value because no replacement would occur in those. For columns having a missing the list of candidate REPLACEMENTS excludes the missing and the REPLACEMENT_COUNT is correct for computing the uniform probability of replacement, 1/COUNT, coded as rownum = ceil (random)
I am supposed to create a summary data set containing the mean, median, and standard deviation broken down by gender and group (using the CLASS statement). Using this summary data set, create four other data sets (in one DATA step) as follows:
(1) grand mean
(2) stats broken down by gender
(3) stats broken down by group
(4) stats broken down by gender and group
Given the hint to use the CHARTYPE option.
I provided my attempted solution, but I don't think I did it in the way asked.
DATA CLINICAL;
*Use LENGTH statement to control the order of
variables in the data set;
LENGTH PATIENT VISIT DATE_VISIT 8;
RETAIN DATE_VISIT WEIGHT;
DO PATIENT = 1 TO 25;
IF RANUNI(135) LT .5 THEN GENDER = 'Female';
ELSE GENDER = 'Male';
X = RANUNI(135);
IF X LT .33 THEN GROUP = 'A';
ELSE IF X LT .66 THEN GROUP = 'B';
ELSE GROUP = 'C';
DO VISIT = 1 TO INT(RANUNI(135)*5);
IF VISIT = 1 THEN DO;
DATE_VISIT = INT(RANUNI(135)*100) + 15800;
WEIGHT = INT(RANNOR(135)*10 + 150);
END;
ELSE DO;
DATE_VISIT = DATE_VISIT + VISIT*(10 + INT(RANUNI(135)*50));
WEIGHT = WEIGHT + INT(RANNOR(135)*10);
END;
OUTPUT;
IF RANUNI(135) LT .2 THEN LEAVE;
END;
END;
DROP X;
FORMAT DATE_VISIT DATE9.;
RUN;
PROC MEANS DATA=CLINICAL;
CLASS GENDER GROUP;
OUTPUT OUT=SUMMARY
MEAN=
MEDIAN=
STDDEV= / AUTONAME;
RUN;
No, what they're asking you to do is:
Use the OUTPUT statement in PROC MEANS to create a summary dataset. Choose the appropriate TYPES and CLASS values in PROC MEANS such that all four sets of data are represented on the output.
Using a single data step that has four dataset names on the data statement, selectively output those rows to the correct dataset. You would use the _TYPE_ variable to determine which dataset a row would be output to.
CHARTYPES just means your _TYPE_ variable will look like 1001 instead of 9 (the binary representation, basically). 1001 indicates which class variable is used (the first and the fourth) to create that breakout. (With only two class variables, you would have values 00, 01, 10, 11 possible). This is sometimes easier for non-programmers who aren't used to thinking in binary (these values would be 0, 1, 2, and 3 in decimal without CHARTYPES and thus might be more difficult for you to tell which corresponds to which variable).
This is a follow-up to my previous post on SO.
I am trying to produce a frequency table of demographics, including race, sex, and ethnicity. One table is a crosstab of race by sex for Hispanic participants in a study. However, there are no Hispanic participants thus far. So, the table will be all zeroes, but we still have to report it.
This can be done in R, but so far, I have found no solution for SAS. Example data is below.
data race;
input race eth sex ;
cards;
1 2 1
1 2 1
1 2 2
2 2 1
2 2 2
2 2 1
3 2 2
3 2 2
3 2 1
4 2 2
4 2 1
4 2 2
run;
data class;
do race = 1,2,3,4,5,6,7;
do eth = 1,2,3;
do sex = 1,2;
output;
end;
end;
end;
run;
proc format;
value frace 1 = "American Indian / AK Native"
2 = "Asian"
3 = "Black or African American"
4 = "Native Hawiian or Other PI"
5 = "White"
6 = "More than one race"
7 = "Unknown or not reported" ;
value feth 1 = "Hispanic or Latino"
2 = "Not Hispanic or Latino"
3 = "Unknown or Not reported" ;
value fsex 1 = "Male"
2 = "Female" ;
run;
***** ethnicity by sex ;
proc tabulate data = race missing classdata=class ;
class race eth sex ;
table eth, sex / misstext = '0' printmiss;
format race frace. eth feth. sex fsex. ;
run;
***** race by sex ;
proc tabulate data = race missing classdata=class ;
class race eth sex ;
table race, sex / misstext = '0' printmiss;
format race frace. eth feth. sex fsex. ;
run;
***** race by sex, for Hispanic only ;
***** log indicates that a logical page with only missing values has been deleted ;
***** Thanks SAS, you're a big help... ;
proc tabulate data = race missing classdata=class ;
where eth = 1 ;
class race eth sex ;
table race, sex / misstext = '0' printmiss;
format race frace. eth feth. sex fsex. ;
run;
I understand that the code really can't work because I'm selecting where eth is equal to 1 (there are no cases satisfying the condition...). Specifying the command to be run by eth doesn't work either.
Any guidance is greatly appreciated...
I think the easiest way is to create a row in the data that has the missing value. You could look at the following paper for suggestions as to how to do this on a larger scale:
http://www.nesug.org/Proceedings/nesug11/pf/pf02.pdf
PROC FREQ has the SPARSE option, which gives you all possible combinations of all variables in the table (including missing ones), but it doesn't look like that gives you exactly what you need.
Looks like our good friends at Westat have worked with this issue. A description of there solution is shown here.
The code is shown below for convenience, but please cite the original when referenced
PROC FORMAT;
value ethnicf
1 = 'Hispanic or Latino'
2 = 'Not Hispanic or Latino'
3 = 'Unknown (Individuals Not Reporting Ethnicity)';
value racef
1 = 'American Indian or Alaska Native'
2 = 'Asian'
3 = 'Native Hawaiian or Other Pacific Islander'
4 = 'Black or African American'
5 = 'White'
6 = 'More Than One Race'
7 = 'Unknown or Not Reported';
value gndrf
1 = 'Male'
2 = 'Female'
3 = 'Unknown or Not Reported';
RUN;
DATA shelldata;
format ethlbl ethnicf. racelbl racef. gender gndrf.;
do ethcat = 1 to 2;
do ethlbl = 1 to 3;
do racelbl = 1 to 7;
do gender = 1 to 3;
output;
end;
end;
end;
end;
RUN;
DATA test;
input pt $ 1-3 ethlbl gender racelbl ;
cards;
x1 2 1 5
x2 2 1 5
x3 2 1 5
x4 2 1 5
x5 2 1 5
x6 2 2 2
x7 2 2 2
x8 2 2 5
x9 2 2 4
x10 2 2 4
RUN;
DATA enroll;
set test;
if ethlbl = 1 then ethcat = 1;
else ethcat = 2;
format ethlbl ethnicf. racelbl racef. gender gndrf.;
label ethlbl = 'Ethnic Category'
racelbl = 'Racial Categories'
gender = 'Sex/Gender';
RUN;
%MACRO TAB_WHERE;
/* PROC SQL step creates a macro variable whose */
/* value will be the number of observations */
/* meeting WHERE clause criteria. */
PROC SQL noprint;
select count(*)
into :numobs
from enroll
where ethcat=1;
QUIT;
/* PROC FORMAT step to display all numeric values as zero. */
PROC FORMAT;
value allzero low-high=' 0';
RUN;
/* Conditionally execute steps when no observations met criteria. */
%if &numobs=0 %then
%do;
%let fmt = allzero.; /* Print all cell values as zeroes */
%let str = ; /*No Cases in Subset - WHERE cannot be used */
%end;
%else
%do;
%let fmt = 8.0;
%let str = where ethcat = 1;
%end;
PROC TABULATE data=enroll classdata=shelldata missing format=&fmt;
&str;
format racelbl racef. gender gndrf.;
class racelbl gender;
classlev racelbl gender;
keyword n pctn all;
tables (racelbl all='Racial Categories: Total of Hispanic or Latinos'),
gender='Sex/Gender'*N=' ' all='Total'*n='' / printmiss misstext='0'
box=[LABEL=' '];
title1 font=arial color=darkblue h=1.5 'Inclusion Enrollment Report';
title2 ' ';
title3 font=arial color=darkblue h=1' PART B. HISPANIC ENROLLMENT REPORT:
Number of Hispanic or Latinos Enrolled to Date (Cumulative)';
RUN;
%MEND TAB_WHERE;
%TAB_WHERE
I found this paper to be very informative:
Oh No, a Zero Row: 5 Ways to Summarize Absolutely Nothing
The preloadfmt option in proc means (Method 5) is my favorite. Once you create the necessary formats it's not necessary to add dummy data. It's odd that they haven't yet added this option to proc freq.
I have a column in my sas file as age and another column as finalage. I want to substitute the values in age column by values in agefinal column for just one ID (that is 5)
The code that I used was:
Data temp;
set temp;
if ID = 5;
then age = agefinal;
run;
I could not substitute the values. The values in age column did not change. I tried to run this code to check the character length of values since character type is numeric for both the columns.
Code:
Proc contents data = temp;
tables age agefinal;
run;
The output that I got was:
age : character length 3.
agefinal: character length $3
I would appreciate your suggestions.
Try removing the semicolon at the end of the if statement. Right now what you're doing is deleting all records where the id isn't equal to five.
Try setting the formats to be the same
data temp;
modify temp;
format age agefinal $3.;
run;
and then see if it will let you do the substitution.
The code you provided runs with an ERROR, remove the additional semicolon and that may fix your issue:
/* ORIGINAL */
Data temp;
set temp;
if ID = 5;
then age = agefinal;
run;
/* CORRECTED */
Data temp;
set temp;
if ID = 5 /* REMOVED SEMICOLON */
then age = agefinal;
run;
Cheers
Rob