I am using the SAS-University edition.
I need to create a new ethnicity variable based on existing sub ethnicity.
Is there a way to select 'WHITE all' instead of specifying 'WHITE-RUSSIAN' or 'WHITE-EUROPEAN' and create the new variable.
Here is my code.
data agg1;
set agg;
if ethnicity = 'WHITE' then ethnicity1= 'white' ;
if ethnicity = 'WHITE-RUSSIAN' then ethnicity1= 'white' ;
if ethnicity = 'WHITE-EUROPEAN' then ethnicity1= 'white';
.
.
run ;
Use a : modifier:
if ethnicity eq: 'WHITE' then ethnicity1= 'white' ;
Assuming the values of the ethnicity variable are always in the format '<main>-<sub>', you can reduce this to one line and zero if statements:
data agg1;
set add;
ethnicity1=scan(ethnicity,1,'-');
run;
Related
I want to display my data as either yes or no in the output for initaltesting, site visit, and follow up, how would I do that? There are numeric values for this on the data set but want character responses of "y" or "n"
PROC FORMAT;
VALUE SiteVisitfmt 1 = 'yes'
0 = 'no';
VALUE InitialTestingfmt 1 = 'yes'
2 = 'no';
VALUE TestEventfmt 1 = 'One Event '
2 = 'Two Events'
3 = 'Three Events'
4 = 'Four Events'
5 = 'Five Events';
VALUE FollowUpfmt 1 = 'yes'
0 = 'no';
FORMAT SiteVisit SiteVisitfmt. InitialTesting InitialTestingfmt. TestEvent TestEventfmt.
FollowUp FollowUpfmt.;
RUN;
data PMdataedits;
set PMdata (rename = (Number_of_Days_from_Onset_to_Sit =SiteVisit
Number_of_Days_between_Onset_and = InitialTesting
Number_of_Test_Events_in_IRIS = TestEvent
Number_of_Days_between_Test_1_an = FollowUp));
drop SPA;
attrib date1 format=date9.;
date1=input(date,mmddyy10.);
NewSiteVisit = put(SiteVisit, 8.);
NewInitialTesting = put(InitialTesting, 8.);
NewFollowUp = put(FollowUp, 8.);
NewSiteVisit=;
if (NewSiteVisit=<1) THEN NewSiteVisit= '1';
if (NewSiteVisit>1) THEN NewSiteVist= '0';
NewInitialTesting=;
if (NewInitialTesting<=2) THEN NewInitialTesting= '1';
if (NewInitialTesting>2) THEN NewInitialTesting='0';
This statement:
FORMAT SiteVisit SiteVisitfmt. InitialTesting InitialTestingfmt. TestEvent TestEventfmt.
FollowUp FollowUpfmt.;
Needs to be on the data step (sometime after data PMdataedits; but before the run; that you don't show), not in the proc format. That's the statement that assigns the format to a variable; each dataset (which is defined by a data step) has its own, unique set of variables that can be the same name as other datasets but have different contents and formats.
Also note that you don't have to name the formats after the variables, and don't need three different yes/no formats. You could have done:
proc format;
format ynf
'1'='yes'
'0'='no'
;
run;
And then used
format sitevisit initialtesting followup ynf.;
And that would have covered all three of them with one format. But what you did is legal, it's just more typing than you need!
Variable name is PRC. This is what I have so far. First block to delete negative values. Second block is to delete missing values.
data work.crspselected;
set work.crspraw;
where crspyear=2016;
if (PRC < 0)
then delete;
where ticker = 'SKYW';
run;
data work.crspselected;
set work.crspraw;
where ticker = 'SKYW';
where crspyear=2016;
where=(PRC ne .) ;
run;
Instead of using a function to remove negative and missing values, it can be done more simply when inputting or outputting the data. It can also be done with only one data step:
data work.crspselected;
set work.crspraw(where = (PRC >= 0 & PRC ^= .)); * delete values that are negative and missing;
where crspyear = 2016;
where ticker = 'SKYW';
run;
The section that does it is:
(where = (PRC >= 0 & PRC ^= .))
Which can be done for either the input dataset (work.crspraw) or the output dataset (work.crspselected).
If you must use a function, then the function missing() includes only missing values as per this answer. Hence ^missing() would do the opposite and include only non-missing values. There is not a function for non-negative values. But I think it's easier and quicker to do both together simultaneously without a function.
You don't need more than your first test to remove negative and missing values. SAS treats all 28 missing values (., ._, .A ... .Z) as less than any actual number.
In SAS EG, I have a user defined format
value $MDC
'001' = '77'
'002' = '77
...
'762' = '14'
etc.
My data set has DRG_code string variables with values like '001' and '140'.
I was trying to create a new variable, with the below code.
MDC = put(DRG_code, $MDC.)
Only there are more values for the variable DRG_code in my data set, then specified in the user defined format file, $MDC.
For example, when the data set DRG_Code equals '140' this value does not exist in the user defined format, and for some reason the put statement is returning MDC = '14' (which should only be its value with the DRUG code is '762').
Is there a way to make sure my put statement only returns a value from the user defined format when a corresponding value is present?
Grateful for feedback.
Lori
I've tried using formatting like "length" to have my put statement return 3, which I thought would result in "140" instead of "14" and that didn't work.
value $MDC
'001' = '77'
'002' = '77
...
'762' = '14'
MDC = put(DRG_code, $MDC.)
Formats have a DEFAULT width. If you do not specify a width when using the format then SAS will use the default width. When making a user defined format PROC FORMAT will set the default width to the maximum width of the formatted values. In your example the default width is being set to 2.
You can override that when you use the format.
MDC = put(DRG_code, $MDC3.)
Or you could define the default when you define the format.
value $MDC (default=3)
'001' = '77'
'002' = '77'
...
'762' = '14'
;
You can also set a default value for the unmatched values using the other keyword.
value $MDC (default=3)
'001' = '77'
'002' = '77'
...
'762' = '14'
other = 'UNK'
;
You can even nest a call to another format for the unmatched values (or any target format). In which case you do not need to specify the default width since the width on the nested format will be used when defining the default width.
value $MDC
'001' = '77'
'002' = '77'
...
'762' = '14'
other = [$3.]
;
I presume all the value mappings were $2 because that is what is used for an 'unfound' source value. In order to ensure the length of 'unfound' values, make sure one of the formatted values has trailing spaces filling out to length of longest unfound value.
value $MDC
'001' = '77 ' /* 7 characters, presuming no DRG_code exceeds 7 characters */
'002' = '77'
'762 = '14'
You can also fix this by specifying a length to use when applying the format, e.g.
proc format;
value $MDC
'001' = '77'
'762' = '14'
;
run;
data _null_;
do var = '001','140','762';
var_formatted = quote(put(var,$MDC3.));
put var= var_formatted=;
end;
run;
Output:
var=001 var_formatted="77 "
var=140 var_formatted="140"
var=762 var_formatted="14 "
N.B. both this solution and Richard's will result in trailing whitespace being added to formatted values, as you can see from the quotes.
Here I propose a slight modification to user667489's solution so that:
you don't need to specify the length of the format every time you use it (using the default option of the value statement when defining the format)
the resulting formatted value doesn't have trailing blanks (using the trim() function on the output resulting from applying the format)
i.e.
proc format;
value $MDC(default=3)
'001' = '77'
'002' = '77'
'762' = '14'
;
run;
data _null_;
do var = '001', '140', '762';
var_formatted = quote(trim(put(var, $MDC.)));
put var= var_formatted=;
end;
run;
which gives the following output:
var=001 var_formatted="77"
var=140 var_formatted="140"
var=762 var_formatted="14"
I have a dataset with multiple subgroups (variable economist) and dates (variable temps99).
I want to run a tabsplit command that does not accept bysort or by prefixes. So I created a macro to apply my tabsplit command to each of my subgroups within my data.
For example:
levelsof economist, local(liste)
foreach gars of local liste {
display "`gars'"
tabsplit SubjectCategory if economist=="`gars'", p(;) sort
return list
replace nbcateco = r(r) if economist == "`gars'"
}
For each subgroup, Stata runs the tabsplit command and I use the variable nbcateco to store count results.
I did the same for the date so I can have the evolution of r(r) over time:
levelsof temps99, local(liste23)
foreach time of local liste23 {
display "`time'"
tabsplit SubjectCategory if temps99 == "`time'", p(;) sort
return list
replace nbcattime = r(r) if temps99 == "`time'"
}
Now I want to do it on each subgroups economist by date temps99. I tried multiple combination but I am not very good with macros (yet?).
What I want is to be able to have my r(r) for each of my subgroups over time.
Here's a solution that shows how to calculate the number of distinct publication categories within each by-group. This uses runby (from SSC). runby loops over each by-group, each time replacing the data in memory with the data from the current by-group. For each by-group, the commands contained in the user's program are executed. Whatever is left in memory when the user's program terminates is considered results and accumulates. Once all the groups have been processed, these results replace the data in memory.
I used the verbose option because I wanted to present the results for each by-group using nice formatting. The derivation of the list of distinct categories is done by splitting each list, converting to a long layout, and reducing to one observation per distinct value. The distinct_categories program generates one variable that contains the final count of distinct categories for the by-group.
* create a demontration dataset
* ------------------------------------------------------------------------------
clear all
set seed 12345
* Example generated by -dataex-. To install: ssc install dataex
clear
input str19 economist
"Carmen M. Reinhart"
"Janet Currie"
"Asli Demirguc-Kunt"
"Esther Duflo"
"Marianne Bertrand"
"Claudia Goldin"
"Bronwyn Hughes Hall"
"Serena Ng"
"Anne Case"
"Valerie Ann Ramey"
end
expand 20
bysort economist: gen temps99 = 1998 + _n
gen pubs = runiformint(1,10)
expand pubs
sort economist temps99
gen pubid = _n
local nep NEP-AGR NEP-CBA NEP-COM NEP-DEV NEP-DGE NEP-ECM NEP-EEC NEP-ENE ///
NEP-ENV NEP-HIS NEP-INO NEP-INT NEP-LAB NEP-MAC NEP-MIC NEP-MON ///
NEP-PBE NEP-TRA NEP-URE
gen SubjectCategory = ""
forvalues i=1/19 {
replace SubjectCategory = SubjectCategory + " " + word("`nep'",`i') ///
if runiform() < .1
}
replace SubjectCategory = subinstr(trim(SubjectCategory)," ",";",.)
leftalign // from SSC
* ------------------------------------------------------------------------------
program distinct_categories
dis _n _n _dup(80) "-"
dis as txt "fille = " as res economist[1] as txt _col(68) " temps = " as res temps99[1]
// if there are no subjects for the group, exit now to avoid a no obs error
qui count if !mi(trim(SubjectCategory))
if r(N) == 0 exit
// split categories, reshape to a long layout, and reduce to unique values
preserve
keep pubid SubjectCategory
quietly {
split SubjectCategory, parse(;) gen(cat)
reshape long cat, i(pubid)
bysort cat: keep if _n == 1
drop if mi(cat)
}
// show results and generate the wanted variable
list cat
local distinct = _N
dis _n as txt "distinct = " as res `distinct'
restore
gen wanted = `distinct'
end
runby distinct_categories, by(economist temps99) verbose
This is an example of the XY problem, I think. See http://xyproblem.info/
tabsplit is a command in the package tab_chi from SSC. I have no negative feelings about it, as I wrote it, but it seems quite unnecessary here.
You want to count categories in a string variable: semi-colons are your separators. So count semi-colons and add 1.
local SC SubjectCategory
gen NCategory = 1 + length(`SC') - length(subinstr(`SC', ";", "", .))
Then (e.g.) table or tabstat will let you explore further by groups of interest.
To see the counting idea, consider 3 categories with 2 semi-colons.
. display length("frog;toad;newt")
14
. display length(subinstr("frog;toad;newt", ";", "", .))
12
If we replace each semi-colon with an empty string, the change in length is the number of semi-colons deleted. Note that we don't have to change the variable to do this. Then add 1. See also this paper.
That said, a way to extend your approach might be
egen class = group(economist temps99), label
su class, meanonly
local nclass = r(N)
gen result = .
forval i = 1/`nclass' {
di "`: label (class) `i''"
tabsplit SubjectCategory if class == `i', p(;) sort
return list
replace result = r(r) if class == `i'
}
Using statsby would be even better. See also this FAQ.
Essentially, what I would like to do is use the min/max function while altering a table. I am altering a table, adding a column, and then having that column set to a combination of a min/max function. In SAS, however, you can't use summary functions. Is there a way to go around this?
There are many more inputs but for the sake of clarity, a condensed version is below! Thanks!
%let variable = 42
alter table X add Z float;
update X
set C = min(max(0,500 - %sysevalf(variable)),0);
First, let's remove the %sysevalf(), they are not needed and format for readability
alter table claims.simulation add Paid_Claims_NoISL float;
update claims.simulation
set Paid_Claims_NoISL
= min(
max(0
, Allowed_Claims -&OOPM
, min(Allowed_Claims
,&Min_Paid+ max(Allowed_Claims - &Deductible * &COINS
,0
)
)
, &Ind_Cap_Claim
)
);
Notice that the first min() only has 1 argument. That is causing your ERROR. SAS thinks that because it only has 1 input, you want to summarize a column, which is not allowed in an update.
Just take that out and it should work:
alter table claims.simulation add Paid_Claims_NoISL float;
update claims.simulation
set Paid_Claims_NoISL
= max(0
, Allowed_Claims -&OOPM
, min(Allowed_Claims
,&Min_Paid+ max(Allowed_Claims - &Deductible * &COINS
,0
)
)
, &Ind_Cap_Claim
);
To reference the value of a macro variable you need to use &.
%let mvar = 42;
proc sql;
update X set C = min(max(0,500 - &mvar),0);
quit;
Note there is no need to use the macro function %SYSEVALF() since SAS can more easily handle the case when the value &mvar is an expression than the macro processor can.
%let mvar = 500 - 42;
proc sql;
update X set C = min(max(0,&mvar),0);
quit;