Base SAS 9.2-- SUBSTR function to classify Zip Codes into regions - sas

I have a variable full of ZIP code observations and I want to sort those ZIP codes into four regions based on the first three digits of the code.
For example, all ZIP codes that start with 350, 351, or 352 should be grouped into a region called "central." Those that start with 362, 368, 360 or 361 should be in a region called "east." Etc.
How do I get base SAS to look at only the first three digits of the ZIP code variable?
What is the best way to associate those digits with a new variable called "region?"
Here's the code I have so far:
data work.temp;
set library.dataset;
a= substr (Zip_Code,1,3);
put a;
keep Zip_Code a;
run;
proc print data=work.temp;
run;
The column a is blank in my proc print results, however.
Thanks for your help

As #joe explains, this is due to zipcode being defined as numeric variable. I have seen this happening in one of the client locaton, that zipcode is defined as numeric. It lead to various data issues . You should try to define zipcode as character variable and then you can assign regions by using if statements or by reference table or by proc format. Below are exaples of if statement and reference tables. I find reference table method very robust.
data have;
input zip_code $;
datalines;
35099
35167
35245
36278
36899
36167
;
By if statement
data work.temp;
set have;
if in('350', '351', '352') then Region ='EAST';
if substr (Zip_Code,1,3) in('362', '368', '361') then REgion ='WEST';
run;
By use of reference table
data reference;
input code $ Region $;
datalines;
350 EAST
351 EAST
352 EAST
362 WEST
368 WEST
361 WEST
;
proc sql;
select a.*, b.region from have a
left join
reference b
on substr (Zip_Code,1,3) = code;

If a is blank, then your zip_code variable is almost certainly numeric. You probably have a note about numeric to character conversion.
SAS will happily allow you to ignore numeric and character in most instances, but it won't always give correct behavior. In this case, it's probably converting it with the BEST12 format, meaning, 60601 becomes " 60601". So substr(that,1,3) gives " ", of course.
Zip code ideally would be stored in a character variable as it's an identifier, but if it's not for whatever reason, you can do this:
a = substr(put(zip_code,z5.),1,3);
The Zw.d format is correct since you want Massachusetts to be "02101" and not "2101 ".

Related

SAS character and numeric change with set statement

I am working to merge two data sets and get the following error:
Variable DOB has been defined as both character and numeric.
Here is my code. I know I need a set statement to change the character to numeric. I was thinking:
DATA Merged1;
SET Aug21 Aug22;
RUN;
set (rename=(DOB=DOBnum));
length DOB $ 10.;
DOB= put(DOBnum,f10. -L);
drop DOBnum;
Would this be placed before my Set statement to merge to Aug 21 Aug 22?
Thank you!
I tried to run the code but it would not merge, unsure if where the Set statement for DOB would go
You do not need the second SET statement. You need to add the RENAME= dataset option to the dataset where it is mentioned in the first SET statement.
So something like:
DATA BOTH;
SET Aug21 Aug22(in=in2 rename=(DOB=DOBnum));
if in2 then DOB= put(DOBnum,f10. -L);
drop DOBnum;
RUN;
To get a more detailed answer provide more details about the variables and the types of values they contain. For example if DOB means Date of Birth then it does not make much sense to use the F format. If DOB should be an actual DATE then it should be numeric and not character. And if the version that is numeric has actual date values then converting them to text using the F format is going to generate strings that will be confusing for humans.
If you're a beginner I recommend two steps so you can trace the work.
Convert dob from character to numeric
Append the two datasets together (assume you're stacking the data sets)
Use format to control how the date is displayed
*convert character to numeric SAS date;
data aug21_convert2num;
set aug21(rename=dob=dobchar);
dob = input(dob, anydtdte.);
drop dobchar;
run;
*append the two data sets;
data want;
set aug21_convert2num aug22;
format dob yymmdd10.;
run;

How to split a column into multiple rows in SAS

I have a SAS table that I imported from Oracle with two fields. SYSTEMID and T_BLOB.
Inside the T_BLOB field there is data:
2203 Mountain Meadow===========OSCAR ST===========Zephyrhill Road
(why they are delimiting with equal signs I do not know nor do I know who to ask).
I'm new to SAS and I'm being asked to split T_BLOB field into multiple rows in a table called rick.split_blob. I tried Google but I can't find the exact example. I'm trying to get the output to look like:
SYSTEM_ID T_BLOB
GID_1 2203 Mountain Ave
GID_1 OSCAR ST
GID_1 Zephyrhill Road
Can anyone help me with how to code this?
If none of the values ever contain = then you can just use the scan() function.
data want;
set have ;
length T_BLOB_VALUE $200 ;
do i=1 by 1 until(t_blob_value=' ');
t_blob_value=scan(t_blob,i,'=') ;
if i=1 or t_blob_value ne ' ' then output;
end;
run;
You could try this:
data rick.split_blob (keep=SYSTEM_ID T_BLOB_SUB rename=(T_BLOB_SUB=T_BLOB));
set orig_dataset;
T_BLOB_TRANS = tranwrd(T_BLOB,"===========","|");
do i = 1 to countw(T_BLOB_TRANS,"|");
T_BLOB_SUB = scan(T_BLOB,i,"|");
output;
end;
run;
What I'm trying to do is first translate the odd string of equals signs to a simple pipe to avoid counting them as consecutive delimiters. Then we determine how many "words" (really - delimited strings) there are in T_BLOB_TRANS so we know how many times to run the DO loop. Finally we read everything between each delimiter and output it to a new T_BLOB variable for each new word.
It looks like you'll want to use a combination of the "scan" function and the "output" statement (with countw to get you the number of words if it is variable). Scan returns the nth word where you can specify the delimiter. Output outputs a record. So, for example, you can say
do i=1 to countw(line);
newvar = scan(line,i);
output;
end;

enter column in a dataset to an array

I have 33 different datasets with one column and all share the same column name/variable name;
net_worth
I want to load the values into arrays and use them in a datastep. But the array that I use should depend on the the by groups in the datastep (country by city). There are total of 33 datasets and 33 groups (country by city). each dataset correspond to exactly one by group.
here is an example what the by groups look like in the dataset: customers
UK 105 (other fields)
UK 102 (other fields)
US 291 (other fields)
US 292 (other fields)
Could I get some advice on how to go about and enter the columns in arrays and then use them in a datastep. or do you suggest to do it in another way?
%let var1 = uk105
%let var2 = uk102
.....
&let var33 = jk12
data want;
set customers;
by country city;
if _n_ = 1 then do;
*set datasets and create and populate arrays*;
* use array values in calculations with fields from dataset customers, depending on which by group. if the by group is uk and city is 105 then i need to use the created array corresponding to that by group;
It is a little hard to understand what you want.
It sounds like you have one dataset name CUSTOMERS that has all of the main variables and a bunch of single variable datasets that the values of NET_WORTH for a lot of different things (Countries?).
Assuming that the observations in all of the datasets are in the same order then I think you are asking for how to generate a data step like this:
data want;
set customers;
set uk105 (rename=(net_worth=uk105));
set uk103 (rename=(net_worth=uk103));
....
run;
Which might just be easiest to do using a data step.
filename code temp;
data _null_;
input name $32. ;
file code ;
put ' set ' name '(rename=(net_worth=' name '));' ;
cards;
uk105
uk102
;;;;
data want;
set customers;
%include code / source2;
run;

SAS Macro - create a running macro as concatenation of another set of macros

I have a macro related issue that I’m currently struggling to develop and understand. Any pointer on resolving this would be greatly appreciated :-)
It goes something similar below:
I have a ‘n’ [variable] number of macro variables ‘Key’ which resolve to
&Key1=1 &Key=2 &Key3=3 …………….. ………. &Keyn=n
I want to create an automatic running macro ‘Masterkey’ that goes something like
&Masterkey=1 &Masterkey2=12 &Masterkey=123 ………. ……….
i.e. &MasterKeyN=123…..N
How can I get this to work to create ‘&MasterkeyN’ where N is not fixed as can be variable subject to each set of cases with [1-n] keys?
Many thanks.
Nad
I think this probably isn't a useful technique, but I'll answer it anyway.
I'll also assume that &Key1..n may have values other than the number stored in them, and you want those values collected into the &MasterKey1..n variables.
What you'd need to do is use a nested loop, and to know a bit about how macro variables resolve.
%let key1=A;
%let key2=B;
%let key3=C;
%global MasterKey1 MasterKey2 MasterKey3; *so they work outside of the macro;
%macro create_master(numKeys=);
%do master=1 %to &numKeys; *Outer loop for the MasterKeys we want to make;
%let temp=;
%do keyiter = 1 %to &master; *Inner loop for the keys that fall into the MasterKey;
%let temp = &temp.&&Key&keyiter.; *&& delays macro variable resolution one time.;
%end;
%let MasterKey&master.=&temp.;
%end;
%mend create_master;
%create_master(numkeys=3);
%put &=MasterKey1 &=MasterKey2 &=MasterKey3;
The magic here is &&. Basically, during macro variable parsing, you deal with one or two &s at a time. If it helps put some %put statements inside the loop to see how it works.
To start with, let's jump in towards the end. On this iteration, &temp=AB &Keyiter=3 and &Key3=C.
0. &temp.&&Key&keyiter
1. AB&Key3
2. ABC
So from 0 to 1, the parser sees &temp., the period denoting the end of one variable, so it looks up what is that: &temp.=AB and replaces it with AB. Then it sees two &s, and replaces them with one & but doesn't attempt to resolve anything with them. Then it sees Key, no ampersands there so nothing to do. Then it sees &keyiter, okay, replace that with 3.
Then from 1 to 2, it sees AB, ignores it as it should. Then it sees &Key3 (two ampersands became one don't forget), and now it knows to resolve that to C, which it does - thus ABC.
Many thanks to everyone for helpful comments and solutions. Yes, absolutely there can be solutions in SAS in many way. I was probably fixated on approaching that from one angle. Anyway, I've now been able to resolve the issue.
Here's a brief summary of the problem at question and 'the solution below:
I have a number of customer and transaction tables. The objective is to match/ join two tables based on match type+process key and matching keys (matching keys are fields in the tables.
Instructions on matching is given by table similar to one at the bottom..
I am trying to build a macro that contains the join instructions based on certain matching type and process key, e.g. for matching type=Profile and Process key=3, I want to create a macro that contain a string (link below), which can then be fed into a proc sql command:
]
Table1.Name=Table2.Name
And Table1.Address=Table2.Address
]
I created a macro for each of the matching key based on match type and process key, and wanted to have a dynamic concatenation of the keys [with ‘and’ text added before the 2nd and subsequent keys]. The problem I'm having is there's no fixed number of matching keys for any given matching type and process key.
Matching_Type Process_Key Matching_Keys
Profile 1 Name
Profile 1 Address
Profile 1 Gender
Market 1 Name
Market 1 Income
Profile 2 Name
Profile 2 Address
Profile 2 Gender
Profile 2 DoB
Profile 2 Phone_Number
Market 2 Name
Market 2 Address
Market 2 Gender
Market 2 Income
Market 2 Transaction_Amount
Market 2 Credit_Card_Number
Profile 3 Name
Profile 3 Address
Solution:
%macro test;
proc sql noprint;
select left(put(count(distinct matching_type||left(put(process_key,8.))),8.)) into :num
from test;
select distinct matching_type, process_key into :matchkey1 - :matchkey&num, :processkey1-:processkey&num
from test;
%do i=1 %to #
%global &&matchkey&i&&processkey&i;
select 'table1.'||trim(matching_keys)||' = table2.'||trim(matching_keys) into :&&matchkey&i&&processkey&i separated by ' and '
from test
where matching_type="&&matchkey&i" and process_key=&&processkey&i;
%end;
quit;
%mend;
options mprint;
%test;
%put _user_;
Many thanks everyone.

Defining variables in one table based on values in another table

I'm fairly new with SAS and am looking for a little guidance.
I have two tables. One contains my data (something like the below, although much larger):
Data DataTable;
Input Var001 $ Var002;
Datalines;
000 050
063 052
015 017
997 035;
run;
My variables are integers (read in as text) from 000 to 999. There can be as few as two, or as many as 500 depending on what the user is doing.
The second table contains user specified groupings of the variables in the DataTable:
Data Var_Groupings;
input var $ range $ Group_Desc $;
Datalines;
001 025 0-25
001 075 26-75
001 999 76-999
002 030 0-30
002 050 31-50
002 060 51-60
002 999 61-999;
run;
(In actuality, this table in adjusted by the user in excel and then imported, but this will work for the purposes of troubleshooting).
The "var" variable in the var_groupings table corresponds to a var column in the DataTable. So for instance a "var" of 001 in the var_groupings table is saying that this grouping will be on var001 of the DataTable.
the "Range" variable specifics the upper bound of a grouping. So looking at ranges in the var_grouping table where var is equal to 001, the user wants the first group to span from 0 to 25, the second group to span from 26 to 75, and the last group to span from 76 to 999.
EDIT: The Group_Desc column can contain any string and is not necessarily of the form presented here.
the final table should look something like this:
Var001 Var002 Var001_Group Var002_group
000 050 0-25 31-50
063 052 26-75 51-60
015 017 0-25 0-30
997 035 76-999 31-50
I'm not sure how I would even approach something like this. Any guidance you can give would be greatly appreciated.
That's an interesting one, thanks! It can be solved using CALL EXECUTE, since we need to create variable names from values. And obviously PROC FORMAT is the easiest way to convert some values into ranges. So, combining these two things we can do something like this:
proc sort data=Var_Groupings; by var range; run;
/*create dataset which will be the source of our formats' descriptions*/
data formatset;
set Var_Groupings;
by var;
fmtname='myformat';
type='n';
label=Group_Desc;
start=input(lag(range),8.)+1;
end=input(range,8.);
if FIRST.var then start=0;
drop range Group_Desc;
run;
/*put the raw data into new one, which we'll change to get what we want (just to avoid
changing the raw one)*/
data want;
set Datatable;
run;
/*now we iterate through all distinct variable numbers. A soon as we find new number
we generate with CALL EXECUTE three steps: PROC FORMAT, DATA-step to apply this format
to a specific variable, and then PROC CATALOG to delete format*/
data _null_;
set formatset;
by var;
if FIRST.var then do;
call execute(cats("proc format library=work cntlin=formatset(where=(var='",var,"')); run;"));
call execute("data want;");
call execute("set want;");
call execute(cats('_Var',var,'=input(var',var,',8.);'));
call execute(cats('Var',var,'_Group=put(_Var',var,',myformat.);'));
call execute("drop _:;");
call execute("proc catalog catalog=work.formats; delete myformat.format; run;");
end;
run;
UPDATE. I've changed the first DATA-step (for creating formatset) so that now end and start for each range is taken from variable range, not Group_Desc. And PROC SORT moved to the beginning of the code.