SAS Change values of Variables without 3 Data Steps - sas

Often when I am coding in SAS, I need to change values of variables, like turning a character into a numeric, or rounding values. Because of how SAS works as far as I know, I often have to do it in three steps, like so:
data change;
set raw;
words = put(zipcode, $5.);
run;
data drop;
set change;
drop zipcode;
run;
data rename;
set drop;
rename words = "zipcode";
run;
Is there a way to do something like this in a single data or proc step, rather than having to type out three? And not just for a variable type conversion, but for things like the ROUND statement as well.
Thanks in advance.

This is where dataset options are a huge advantage. They work in the DATA Step, SQL, and almost every PROC where a dataset is referenced.
You can do all of this in a single data step multiple ways.
1. An output dataset option
data change(rename=(words = zipcode) );
set raw;
words = put(zipcode, $5.);
drop zipcode;
run;
Here's what's happening:
words is created in the dataset
At the end of the data step, zipcode is dropped.
As the very last step after zipcode is dropped, words is renamed to zipcode
This is called an output dataset option. It's the last thing that happens before the dataset is finally written.
2. An input dataset option
data change;
set raw(rename=(zipcode = _zipcode) );
words = put(_zipcode, $5.);
drop _zipcode;
run;
Here's what's happening:
Before raw is read, zipcode is renamed to _zipcode
words is created
_zipcode is dropped from the dataset
Input/output dataset options are very powerful. You can create special where clauses, indices, compress data, and much more using them.
You can view all of the available dataset options here:
https://go.documentation.sas.com/doc/en/pgmsascdc/9.4_3.5/ledsoptsref/p1pczmnhbq4axpn1l15s9mk6mobp.htm

There is no need to do that in three steps. One step is enough.
data rename;
set raw;
words = put(zipcode, $5.);
drop zipcode;
rename words = zipcode;
run;

Related

using by group processing First. and Last

I just start learning sas and would like some help with understanding the following chunk of code. The following program computes the annual payroll by department.
proc sort data = company.usa out=work.temp;
by dept;
run;
data company.budget(keep=dept payroll);
set work.temp;
by dept;
if wagecat ='S' then yearly = wagrate *12;
else if wagecat = 'H' then yearly = wagerate *2000;
if first.dept then payroll=0;
payroll+yearly;
if last.dept;
run;
Questions:
What does out = work.temp do in the first line of this code?
I understand the data step created 2 temporary variables for each by variable (first.varibale/last.variable) and the values are either 1 or 0, but what does first.dept and last.dept exactly do here in the code?
Why do we need payroll=0 after first.dept in the second to the last line?
This code takes the data for salaries and calculates the payroll amount for each department for a year, assuming salary is the same for all 12 months and that an hourly worker works 2000 hours.
It creates a copy of the data set which is sorted and stored in the work library. RTM.
From the docs
OUT= SAS-data-set
names the output data set. If SAS-data-set does not exist, then PROC SORT creates it.
CAUTION:
Use care when you use PROC SORT without OUT=.
Without the OUT= option, PROC SORT replaces the original data set with the sorted observations when the procedure executes without errors.
Default Without OUT=, PROC SORT overwrites the original data set.
Tips With in-database sorts, the output data set cannot refer to the input table on the DBMS.
You can use data set options with OUT=.
See SAS Data Set Options: Reference
Example Sorting by the Values of Multiple Variables
First.DEPT is an indicator variable that indicates the first observation of a specific BY group. So when you encounter the first record for a department it is identified. Last.DEPT is the last record for that specific department. It means the next record would the first record for a different department.
It sets PAYROLL to 0 at the first of each record. Since you have if last.dept; that means that only the last record for each department is outputted. This code is not intuitive - it's a manual way to sum the wages for people in each department. The common way would be to use a summary procedure, such as MEANS/SUMMARY but I assume they were trying to avoid having two passes of the data. Though if you're not sorting it may be just as fast anyways.
Again, RTM here. The SAS documentation is quite thorough on these beginner topics.
Here's an alternative method that should generate the exact same results but is more intuitive IMO.
data temp;
set company.usa;
if wagecat='S' then factor=12; *salary in months;
else if wagecat='H' then factor=2000; *salary in hours;
run;
proc means data=temp noprint NWAY;
class dept;
var wagerate;
weight factor;
output out=company.budget sum(wagerate)=payroll;
run;

Adding a column is SAS using MODIFY (no sql)

I'm new to SAS and have some problems with adding a column to existing data set in SAS using MODIFY statement (without proc sql).
Let's say I have data like this
id name salary perks
1 John 2000 50
2 Mary 3000 120
What I need to get is a new column with the sum of salary and perks.
I tried to do it this way
data data1;
modify data1;
money=salary+perks;
run;
but apparently it doesn't work.
I would be grateful for any help!
As #Tom mentioned you use SET to access the dataset.
I generally don't recommend programming this way with the same name in set and data statements, especially as you're learning SAS. This is because it's harder to detect errors, since once run and encounter an error, you destroy your original dataset and have to recreate it before you start again.
If you want to work step by step, consider intermediary datasets and then clean up after you're done by using proc datasets to delete any unnecessary intermediary datasets. Use a naming conventions to be able to drop them all at once, i.e. data1, data2, data3 can be referenced as data1-data3 or data:.
data data2;
set data1;
money = salary + perks;
run;
You do now have two datasets but it's easy to drop datasets later on and you can now run your code in sections rather than running all at once.
Here's how you would drop intermediary datasets
proc datasets library=work nodetails holist;
delete data1-data3;
run;quit;
You can't add a column to an existing dataset. You can make a new dataset with the same name.
data data1;
set data1;
money=salary+perks;
run;
SAS will build it as a new physical file (with a temporary name) and when the step finishes without error it deletes the original and renames the new one.
If you want to use a data set you do it like this:
data dataset;
set dataset;
format new_column $12;
new_column = 'xxx';
run;
Or use Proc SQL and ALTER TABLE.
proc sql;
alter table dataset
add new_column char(8) format = $12.
;
quit;

Export/Import Attributes of a SAS dataset

I am working with multiple waves of survey data. I have finished defining formats and labels for the first wave of data.
The second wave of data will be different, but the codes, labels, formats, and variable names will all be the same. I do not want to define all these attributes again...it seems like there should be a way to export the PROC CONTENTS information for one dataset and import it into another dataset. Is there a way to do this?
The closest thing I've found is PROC CPORT but I am totally confused by it and cannot get it to run.
(Just to be clear I'll ask the question another way as well...)
When you run PROC CONTENTS, SAS tells you what format, labels, etc. it is using for each variable in the dataset.
I have a second dataset with the exact same variable names. I would like to use the variable attributes from the first dataset on the variables in the second dataset. Is there any way to do this?
Thanks!
So you have a MODEL dataset and a HAVE dataset, both with data in them. You want to create WANT dataset which has data from HAVE, with attributes of MODEL (formats, labels, and variable lengths). You can do this like:
data WANT ;
if 0 then set MODEL ;
set HAVE ;
run ;
This works because when the DATA step compiles, SAS builds the Program Data Vector (PDV) which defines variable attributes. Even though the SET MODEL never executes (because 0 is not true), all of the variables in MODEL are created in the PDV when the step compiles.
Importantly, note that if there are corresponding variables with different lengths, the length from MODEL will determine the length of the variable in WANT. So if HAVE has a variable that is longer than the same-named variable in MODEL, it may be truncated. Options VARLENCHK determines whether or not SAS throws a warning/error if this happens.
That assumes there are no formats/labels on the HAVE dataset. If there is a variable in HAVE that has a format/label, and the corresponding variable in MODEL does not have a format/label, the format/label from HAVE will be applied to WANT.
Sample code below.
data model;
set sashelp.class;
length FavoriteColor $3;
FavoriteColor="Red";
dob=today();
label
dob='BirthDate'
;
format
dob mmddyy10.
;
run;
data have;
set sashelp.class;
length FavoriteColor $10;
dob=today()-1;
FavoriteColor="Orange";
label
Name="HaveLabel"
dob="HaveLabel"
;
format
Name $1.
dob comma.
;
run;
options varlenchk=warn;
data want;
if 0 then set model;
set have;
run;
I'd create an empty dataset based on the existing one, and then use proc append to append the contents to it.
Create some sample data for the second round of data:
data new_data;
age = 10;
run;
Create an empty dataset based on the original data:
proc sql noprint;
create table want like sashelp.class;
quit;
Append the data into the empty dataset, retaining the details from the original:
proc append base=want data=new_data force nowarn;
run;
Note that I've used the force and nowarn options on proc append. This will ensure the data is appended even if differences are found between the two datasets being used. This is expected if you have, for example, format differences. It will also hide things like if columns exist in the new table that aren't in the old table etc. So be careful that this is doing what you want it to. If the behaviour is undesirable, consider using a datastep to append instead (and list the want dataset first).
Welcome to the stack.
If you want to copy the properties of the table without the data within it, you could use PROC SQL or data step with zero rows read in.
This examples copies all information about the SASHELP.CLASS dataset into a brand new dataset. All formats, attributes, labels, the whole thing is copies over. If you want to only copy some of the columns, specify them in select clause instead of asterix.
PROC SQL outobs=0;
CREATE TABLE WANT as SELECT * FROM SASHELP.CLASS;
QUIT;
Regards,
Vasilij

SAS Transpose Comma Separated Field

This is a follow-up to an earlier question of mine.
Transposing Comma-delimited field
The answer I got worked for the specific case, but now I have a much larger dataset, so reading it in a datalines statement is not an option. I have a dataset similar to the one created by this process:
data MAIN;
input ID STATUS STATE $;
cards;
123 7 AL,NC,SC,NY
456 6 AL,NC
789 7 ALL
;
run;
There are two problems here:
1: I need a separate row for each state in the STATE column
2: Notice the third observation says 'ALL'. I need to replace that with a list of the specific states, which I can get from a separate dataset (below).
data STATES;
input STATE $;
cards;
AL
NC
SC
NY
TX
;
run;
So, here is the process I am attempting that doesn't seem to be working.
First, I create a list of the STATES needed for the imputation, and a count of said states.
proc sql;
select distinct STATE into :all_states separated by ','
from STATES;
select count(distinct STATE) into :count_states
from STATES;
quit;
Second, I try to impute that list where the 'ALL' value appears for STATE. This is where the first error appears. How can I ensure that the variable STATE is long enough for the new value? Also, how do I handle the commas?
data x_MAIN;
set MAIN;
if STATE='ALL' then STATE="&all_states.";
run;
Finally, I use a SCAN function to read in one state at a time. I'm also getting an error here, but I think fixing the above part may solve it.
data x_MAIN_mod;
set x_MAIN;
array state(&count_states.) state:;
do i=1 to dim(state);
state(i) = scan(STATE,i,',');
end;
run;
Thanks in advance for the help!
Looks like you are almost there. Try this on the last Data Step.
data x_MAIN_mod;
set x_MAIN;
format out_state $2.;
nstate = countw(state,",");
do i=1 to nstate;
out_state = scan(state,i,",");
output;
end;
run;
Do you have to actually have two steps like that? You can use a 'big number' in a temporary variable and not have much effect on things, if you don't have the intermediate dataset.
data x_MAIN;
length state_temp $150;
set MAIN;
if STATE='ALL' then STATE_temp="&all_states.";
else STATE_temp=STATE;
array state(&count_states.) state:;
do i=1 to dim(state);
state(i) = scan(STATE,i,',');
end;
drop STATE_temp;
run;
If you actually do need the STATE, then honestly I'd go with the big number (=50*3, so not all that big) and then add OPTIONS COMPRESS=CHAR; which will (give or take) turn your CHAR fields into VARCHAR (at the cost of a tiny bit of CPU time, but usually far less than the disk read/write time saved).

SAS drop cloned observation

I have a dataset structured with an ID and two other variables.
The id is not unique, it appears in the dataset more than 1 time (a patient could receive more than one clinical treatment).How can I drop the entire observation (the entire line) only if it is a perfect clone of a previous observation (based on the other two variable values)? I don't want to use an insanely long if statement.
Thanks.
proc sql;
select distinct * from olddata;
quit;
Sounds like an easy SQL fix. The select distinct option will remove any completely duplicate rows in a dataset if you select all columns.
If you specifically want to identify if two consecutive lines are identical (but are not looking to match identical lines separated by other lines), you can use notsorted on a by statement and then first and last variables.
data want;
set have;
by id var1 var2 notsorted;
if first.var2;
run;
That will keep the first record for any group of identical id/var1/var2, so long as they're consecutive on the dataset. Of course if you sort the dataset by id var1 var2 first this will always remove the duplicates, but not sorted this still works for removing consecutive pairs (or more) that are collocated.
I prefer #JJFord's answer, but for the sake of completeness, this can also be done using the nodupe option in proc sort:
proc sort data=mydata nodupe;
by id;
run;
What you choose as the by variable doesn't really matter here. The important bit is just to specify the nodupe option.