Modify Data Step with Duplicate Values? - sas

I have a dataset that has two columns of data. I'm needing to modify and change the name of the observations and the below is working great.
data NAS_STG;
modify NAS_STG;
if Balance = "Test1" then Balance ="Test Account 1";
run;
However there is one issue. The datasheet this was imported from has some duplicate observations. They aren't actually the same, just in different sections with the same observation name.
Is there a way to change the first occurrence, then after the first occurrence, when it finds the duplicate I can name it something else? I need to ensure the duplicate sequential observations are not named the same.
e.g.
data NAS_STG;
modify NAS_STG;
if Balance = "Workout Loan" then Balance ="Workout"; *** FIRST INSTANCE***
if Balance = "Workout Loan" then Balance ="Int_Workout"; *** SECOND INSTANCE***
run;

No need for MODIFY for this. Use a simple SET statement instead. Probably best to create a new dataset instead of making changes to the existing one. That way if something goes wrong with the code you don't have re-import the XLSX file.
So if the data is not already sorted then sort it.
proc sort data=NAS_STG; by balance; run;
Now use the fact that it is sorted to let you change only one of the observations.
data NAS_STG_fixed;
set NAS_STG;
by balance;
if first.balance and balance = "Test1" then Balance ="Test Account 1";
run;
You might need to re sort it by BALANCE if you are going to use that variable to combine with some other data.
proc sort data=NAS_STG_fixed;
by balance;
run;

Related

SAS Change values of Variables without 3 Data Steps

Often when I am coding in SAS, I need to change values of variables, like turning a character into a numeric, or rounding values. Because of how SAS works as far as I know, I often have to do it in three steps, like so:
data change;
set raw;
words = put(zipcode, $5.);
run;
data drop;
set change;
drop zipcode;
run;
data rename;
set drop;
rename words = "zipcode";
run;
Is there a way to do something like this in a single data or proc step, rather than having to type out three? And not just for a variable type conversion, but for things like the ROUND statement as well.
Thanks in advance.
This is where dataset options are a huge advantage. They work in the DATA Step, SQL, and almost every PROC where a dataset is referenced.
You can do all of this in a single data step multiple ways.
1. An output dataset option
data change(rename=(words = zipcode) );
set raw;
words = put(zipcode, $5.);
drop zipcode;
run;
Here's what's happening:
words is created in the dataset
At the end of the data step, zipcode is dropped.
As the very last step after zipcode is dropped, words is renamed to zipcode
This is called an output dataset option. It's the last thing that happens before the dataset is finally written.
2. An input dataset option
data change;
set raw(rename=(zipcode = _zipcode) );
words = put(_zipcode, $5.);
drop _zipcode;
run;
Here's what's happening:
Before raw is read, zipcode is renamed to _zipcode
words is created
_zipcode is dropped from the dataset
Input/output dataset options are very powerful. You can create special where clauses, indices, compress data, and much more using them.
You can view all of the available dataset options here:
https://go.documentation.sas.com/doc/en/pgmsascdc/9.4_3.5/ledsoptsref/p1pczmnhbq4axpn1l15s9mk6mobp.htm
There is no need to do that in three steps. One step is enough.
data rename;
set raw;
words = put(zipcode, $5.);
drop zipcode;
rename words = zipcode;
run;

Variable has never been referenced [SAS]

I have a dataset containing information by an account number. I'm trying to add a new variable, called product_type, populated with the same value for every record. It is within a SAS macro.
data CC_database;
set cc_base_v2 (keep=accnum date product_type);
product_type="CC";
where date>=%sysevalf("&start_date."d) and file_date<=%sysevalf("&end_date."d);
run;
However, I keep getting the error "The variable product_type has in the DROP, KEEP, or RENAME list has never been referenced" and the output dataset only shows a blank column called product_type. What is happening here?
You are using the KEEP= dataset option on the input dataset. So the error is saying PRODUCT_TYPE does not exist in the dataset CC_BASE_V2.
If you want to control what variables are written by the data step just use a KEEP statement.
keep accnum date product_type;
In your case you could use the KEEP= dataset option but only list the variables taht are coming from CC_BASE_V2.
data CC_database;
set cc_base_v2(keep=accnum date file_date);
where date>= "&start_date."d and file_date<="&end_date."d;
product_type="CC";
drop file_date;
run;

using by group processing First. and Last

I just start learning sas and would like some help with understanding the following chunk of code. The following program computes the annual payroll by department.
proc sort data = company.usa out=work.temp;
by dept;
run;
data company.budget(keep=dept payroll);
set work.temp;
by dept;
if wagecat ='S' then yearly = wagrate *12;
else if wagecat = 'H' then yearly = wagerate *2000;
if first.dept then payroll=0;
payroll+yearly;
if last.dept;
run;
Questions:
What does out = work.temp do in the first line of this code?
I understand the data step created 2 temporary variables for each by variable (first.varibale/last.variable) and the values are either 1 or 0, but what does first.dept and last.dept exactly do here in the code?
Why do we need payroll=0 after first.dept in the second to the last line?
This code takes the data for salaries and calculates the payroll amount for each department for a year, assuming salary is the same for all 12 months and that an hourly worker works 2000 hours.
It creates a copy of the data set which is sorted and stored in the work library. RTM.
From the docs
OUT= SAS-data-set
names the output data set. If SAS-data-set does not exist, then PROC SORT creates it.
CAUTION:
Use care when you use PROC SORT without OUT=.
Without the OUT= option, PROC SORT replaces the original data set with the sorted observations when the procedure executes without errors.
Default Without OUT=, PROC SORT overwrites the original data set.
Tips With in-database sorts, the output data set cannot refer to the input table on the DBMS.
You can use data set options with OUT=.
See SAS Data Set Options: Reference
Example Sorting by the Values of Multiple Variables
First.DEPT is an indicator variable that indicates the first observation of a specific BY group. So when you encounter the first record for a department it is identified. Last.DEPT is the last record for that specific department. It means the next record would the first record for a different department.
It sets PAYROLL to 0 at the first of each record. Since you have if last.dept; that means that only the last record for each department is outputted. This code is not intuitive - it's a manual way to sum the wages for people in each department. The common way would be to use a summary procedure, such as MEANS/SUMMARY but I assume they were trying to avoid having two passes of the data. Though if you're not sorting it may be just as fast anyways.
Again, RTM here. The SAS documentation is quite thorough on these beginner topics.
Here's an alternative method that should generate the exact same results but is more intuitive IMO.
data temp;
set company.usa;
if wagecat='S' then factor=12; *salary in months;
else if wagecat='H' then factor=2000; *salary in hours;
run;
proc means data=temp noprint NWAY;
class dept;
var wagerate;
weight factor;
output out=company.budget sum(wagerate)=payroll;
run;

creating two data sets in one data step in sas

This should be very simple, but somehow I confuse myself.
data in_both
missing_name (drop = name);
merge employee (in=in_employee)
hours (in = in_hours);
by ID;
if in_employee and in_hours then output in_both;
else if in_employee and not in_hours then output missing_name;
run;
I have two questions:
(1): For the first statement "missing_name(drop = name)", I understand that, it means keep all the data except the column whose head is name. But keep which data here? What is the input?
(2): I know we can create two datasets within one data step, but that means we should use "data in_both missing_name", instead of "data in_both", right?
Many thanks for your time and attention. I appreciate your help.
(1) The DROP= option refers to dropping variables from the dataset MISSING_NAME. With no drop= or keep= option, all variables that exist in EMPLOYEE or HOURS would be written to MISSING_NAME. You can run PROC CONTENTS on the four datasets to see which variables are included in each.
(2) As written, your code will output two datasets IN_BOTH and MISSING_NAME. As #Tom just commented, your current DATA statement already lists both datasets, because the semicolon ends the statement, not the white space/carriage return.
The DATA statement is determining which datasets will be created by the data step. The dataset options, like the DROP= option in your example, can we used to control which of the variables are written into those datasets.
It is the OUTPUT statement that is deciding which observations will be written. So in your example your IF/THEN/ELSE logic is determining which output statements to execute.
Using your posted code:
data in_both
missing_name (drop = name);
merge employee (in=in_employee)
hours (in = in_hours);
by ID;
run;
Inputs - merge_employee & hours
Outputs - in_both & missing_name
In this example the output missing_name has the column NAME dropped.
The best way to view what's going on if the line breaks are confusing is to look for the semi-colon. At first glance I got a little confused too!

Keeping only the duplicates

I'm trying to keep only the duplicate results for one column in a table. This is what I have.
proc sql;
create table DUPLICATES as
select Address, count(*) as count
from TEST_TABLE
group by Address
having COUNT gt 1
;
quit;
Is there any easier way to do this or an alternative I didn't think of? It seems goofy that I then have to re-join it with the original table to get my answer.
proc sort data=TEST_TABLE;
by Address;
run;
data DUPLICATES;
set TEST_TABLE;
by Address;
if not (first.Address and last.Address) then output;
run;
Using proc sort with nodupkey and dupout will dedupe the data and give you an "out" dataset with duplicate records from the original dataset, but the "out" dataset does not include EVERY record with the ID variable - it gives you the 2nd, 3rd, 4th...Nth. So you aren't comparing all the duplicate occurrences of the ID variable when you use this method. It's great when you know what you want to remove and define enough by variables to limit this precisely, or if you know that your records with duplicate IDs are identical in every way and you just want them removed.
When there are duplicates in a raw file I receive, I like to compare all records where ID has more than one occurrence.
proc sort data=test nouniquekeys
uniqueout=singles
out=dups;
by=ID;
run;
nouniquekeys deletes unique observations from the "out" DS
uniqueout=dsname stores unique observations
out=dsname stores remaining observations
Again, this method is great for working with messy raw data, and for debugging if your code might have produced duplicates.
That's easy using a data step:
proc sort data=TEST_TABLE nodupkey dupout=dups;
by Address;
run;
Refer to this documentation for further information
select field,count(field) from table
group by field having count(field) > 1