I'm fairly new with SAS and am looking for a little guidance.
I have two tables. One contains my data (something like the below, although much larger):
Data DataTable;
Input Var001 $ Var002;
Datalines;
000 050
063 052
015 017
997 035;
run;
My variables are integers (read in as text) from 000 to 999. There can be as few as two, or as many as 500 depending on what the user is doing.
The second table contains user specified groupings of the variables in the DataTable:
Data Var_Groupings;
input var $ range $ Group_Desc $;
Datalines;
001 025 0-25
001 075 26-75
001 999 76-999
002 030 0-30
002 050 31-50
002 060 51-60
002 999 61-999;
run;
(In actuality, this table in adjusted by the user in excel and then imported, but this will work for the purposes of troubleshooting).
The "var" variable in the var_groupings table corresponds to a var column in the DataTable. So for instance a "var" of 001 in the var_groupings table is saying that this grouping will be on var001 of the DataTable.
the "Range" variable specifics the upper bound of a grouping. So looking at ranges in the var_grouping table where var is equal to 001, the user wants the first group to span from 0 to 25, the second group to span from 26 to 75, and the last group to span from 76 to 999.
EDIT: The Group_Desc column can contain any string and is not necessarily of the form presented here.
the final table should look something like this:
Var001 Var002 Var001_Group Var002_group
000 050 0-25 31-50
063 052 26-75 51-60
015 017 0-25 0-30
997 035 76-999 31-50
I'm not sure how I would even approach something like this. Any guidance you can give would be greatly appreciated.
That's an interesting one, thanks! It can be solved using CALL EXECUTE, since we need to create variable names from values. And obviously PROC FORMAT is the easiest way to convert some values into ranges. So, combining these two things we can do something like this:
proc sort data=Var_Groupings; by var range; run;
/*create dataset which will be the source of our formats' descriptions*/
data formatset;
set Var_Groupings;
by var;
fmtname='myformat';
type='n';
label=Group_Desc;
start=input(lag(range),8.)+1;
end=input(range,8.);
if FIRST.var then start=0;
drop range Group_Desc;
run;
/*put the raw data into new one, which we'll change to get what we want (just to avoid
changing the raw one)*/
data want;
set Datatable;
run;
/*now we iterate through all distinct variable numbers. A soon as we find new number
we generate with CALL EXECUTE three steps: PROC FORMAT, DATA-step to apply this format
to a specific variable, and then PROC CATALOG to delete format*/
data _null_;
set formatset;
by var;
if FIRST.var then do;
call execute(cats("proc format library=work cntlin=formatset(where=(var='",var,"')); run;"));
call execute("data want;");
call execute("set want;");
call execute(cats('_Var',var,'=input(var',var,',8.);'));
call execute(cats('Var',var,'_Group=put(_Var',var,',myformat.);'));
call execute("drop _:;");
call execute("proc catalog catalog=work.formats; delete myformat.format; run;");
end;
run;
UPDATE. I've changed the first DATA-step (for creating formatset) so that now end and start for each range is taken from variable range, not Group_Desc. And PROC SORT moved to the beginning of the code.
Related
I have 33 different datasets with one column and all share the same column name/variable name;
net_worth
I want to load the values into arrays and use them in a datastep. But the array that I use should depend on the the by groups in the datastep (country by city). There are total of 33 datasets and 33 groups (country by city). each dataset correspond to exactly one by group.
here is an example what the by groups look like in the dataset: customers
UK 105 (other fields)
UK 102 (other fields)
US 291 (other fields)
US 292 (other fields)
Could I get some advice on how to go about and enter the columns in arrays and then use them in a datastep. or do you suggest to do it in another way?
%let var1 = uk105
%let var2 = uk102
.....
&let var33 = jk12
data want;
set customers;
by country city;
if _n_ = 1 then do;
*set datasets and create and populate arrays*;
* use array values in calculations with fields from dataset customers, depending on which by group. if the by group is uk and city is 105 then i need to use the created array corresponding to that by group;
It is a little hard to understand what you want.
It sounds like you have one dataset name CUSTOMERS that has all of the main variables and a bunch of single variable datasets that the values of NET_WORTH for a lot of different things (Countries?).
Assuming that the observations in all of the datasets are in the same order then I think you are asking for how to generate a data step like this:
data want;
set customers;
set uk105 (rename=(net_worth=uk105));
set uk103 (rename=(net_worth=uk103));
....
run;
Which might just be easiest to do using a data step.
filename code temp;
data _null_;
input name $32. ;
file code ;
put ' set ' name '(rename=(net_worth=' name '));' ;
cards;
uk105
uk102
;;;;
data want;
set customers;
%include code / source2;
run;
I have a variable full of ZIP code observations and I want to sort those ZIP codes into four regions based on the first three digits of the code.
For example, all ZIP codes that start with 350, 351, or 352 should be grouped into a region called "central." Those that start with 362, 368, 360 or 361 should be in a region called "east." Etc.
How do I get base SAS to look at only the first three digits of the ZIP code variable?
What is the best way to associate those digits with a new variable called "region?"
Here's the code I have so far:
data work.temp;
set library.dataset;
a= substr (Zip_Code,1,3);
put a;
keep Zip_Code a;
run;
proc print data=work.temp;
run;
The column a is blank in my proc print results, however.
Thanks for your help
As #joe explains, this is due to zipcode being defined as numeric variable. I have seen this happening in one of the client locaton, that zipcode is defined as numeric. It lead to various data issues . You should try to define zipcode as character variable and then you can assign regions by using if statements or by reference table or by proc format. Below are exaples of if statement and reference tables. I find reference table method very robust.
data have;
input zip_code $;
datalines;
35099
35167
35245
36278
36899
36167
;
By if statement
data work.temp;
set have;
if in('350', '351', '352') then Region ='EAST';
if substr (Zip_Code,1,3) in('362', '368', '361') then REgion ='WEST';
run;
By use of reference table
data reference;
input code $ Region $;
datalines;
350 EAST
351 EAST
352 EAST
362 WEST
368 WEST
361 WEST
;
proc sql;
select a.*, b.region from have a
left join
reference b
on substr (Zip_Code,1,3) = code;
If a is blank, then your zip_code variable is almost certainly numeric. You probably have a note about numeric to character conversion.
SAS will happily allow you to ignore numeric and character in most instances, but it won't always give correct behavior. In this case, it's probably converting it with the BEST12 format, meaning, 60601 becomes " 60601". So substr(that,1,3) gives " ", of course.
Zip code ideally would be stored in a character variable as it's an identifier, but if it's not for whatever reason, you can do this:
a = substr(put(zip_code,z5.),1,3);
The Zw.d format is correct since you want Massachusetts to be "02101" and not "2101 ".
I have a dataset which has three variables: Application number, decline code and sequence. Now, there may be multiple decline code for a single application(which will have different sequence number). So the data looks like following:
Application No Decline Code Sequence
1234 FG 1
1234 FK 3
1234 AF 2
1256 AF 2
1256 FK 1
.
.
.
.
And so on
So, I have to put this in wide format such that the first column contains unique application numbers and corresponding to each of them is their decline code(I don't need sequence number, just that decline codes should appear in order of their sequence number from left to right, separated by a comma). Something like below
Application Number Decline Code
1234 FG, AF, FK
1256 FK, AF
..........
.........
And so on
Now I tried ruining proc transpose by application number on SAS. But the problem is that it creates multiple columns with all the decline codes listed and then if a certain decline code doesn't apply for an application, it will show . in that. So their are many missing values and it isn't quite the format I am expecting. Is there any way to do this in SAS or sql?
PROC TRANSPOSE can certainly help here; then you can CATX the variables together if you really just want one variable:
data have;
input ApplicationNo DeclineCode $ Sequence ;
datalines;
1234 FG 1
1234 FK 3
1234 AF 2
1256 AF 2
1256 FK 1
;;;;
run;
proc sort data=have;
by ApplicationNo Sequence;
run;
proc transpose data=have out=want_pre;
by ApplicationNo;
var DeclineCode;
run;
data want;
set want_pre;
length decline_codes $1024;
decline_codes = catx(', ',of col:);
keep ApplicationNo decline_codes;
run;
You could also do this trivially in one datastep, using first and last checks.
data want_ds;
set have;
by ApplicationNo Sequence;
retain decline_codes;
length decline_codes $1024; *or whatever you need;
if first.ApplicationNo then call missing(decline_codes);
decline_codes = catx(',',decline_codes, DeclineCode);
if last.ApplicationNo then output;
run;
I'm new to SAS, and would greatly appreciate anyone who can help me formulate a code. Can someone please help me with formatting changing arrays based on the first column values?
So basically here's the original data:
Category Name1 Name2......... (Changes invariably)
#ofpeople 20 30
#ofproviders 10 5
#ofclaims 40 25
AmountBilled 50 100
AmountPaid 11 35
AmountDed 5 6
I would like to format the values under Name1 to infinite Name# and reformat them to dollar10.2 for any values under Category called 'AmountBilled','AmountPaid','AmountDed'.
Thank you so much for your help!
You can't conditionally format a column (like you might in excel). A variable/column has one format for the entire column. There are tricks to get around this, but they're invariably more complex than should be considered useful.
You can store the formatted value in a character variable, but it loses the ability to do math.
data have;
input category :$10. name1 name2;
datalines;
#ofpeople 20 30
#ofproviders 10 5
#ofclaims 40 25
AmountBilled 50 100
AmountPaid 11 35
AmountDed 5 6
;;;;
run;
data want;
set have;
array names name:; *colon is wildcard (starts with);
array newnames $10 newname1-newname10; *Arbitrarily 10, can be whatever;
if substr(category,1,6)='Amount' then do;
do _t = 1 to dim(names);
newnames[_t] = put(names[_t],dollar10.2);
end;
end;
run;
You could programmatically figure out the newname1000 endpoint using PROC CONTENTS or SQL's DICTIONARY.COLUMNS / SAS's SASHELP.VCOLUMN. Alternately, you could put out the original dataset as a three column dataset with many rows for each category (was it this way to begin with prior to a PROC TRANSPOSE?) and put the character variable there (not needing an array). To me that's the cleanest option.
data have_t;
set have;
array names name:;
format nameval $10.;
do namenum = 1 to dim(names);
if substr(category,1,6)='Amount' then nameval = put(names[namenum],dollar10.2 -l);
else nameval=put(names[namenum],10. -l); *left aligning here, change this if you want otherwise;
output; *now we have (namenum) rows per line. Test for missing(name) if you want only nonmissing rows output (if not every row has same number of names).
end;
run;
proc transpose data=have_t out=want_T(drop=_name_) prefix=name;
by category notsorted;
var nameval;
run;
Finally, depending on what you're actually doing with this, you may have superior options in terms of the output method. If you're doing PROC REPORT for example, you can use compute blocks to set the style (format) of the column conditionally in the report output.
I have a situation where for each unique observation of casenum I would like to run varies queries and arithmatic operations between various observations of 'code' for that 'casenum' (see below). For example for casenum 1234567 I would like subtract data for code 0200 - code 0234 or 531 - 53. Please keep in mind that there are thousands of observations in this dataset. Is there an easy way to do this or to do row comparisons with the particular.
Please note casenum and code are character variables and data is a numeric variable
Here is an example of how the dataset is structured:
casenum code data
1234567 0123 4597
1234567 0234 53
1234567 0100 789
1234567 0200 531
1234567 0300 354
1111112 0123 79
1111112 0234 78
1111112 0100 77
1111112 0200 7954
1111112 0300 35
Here is the logic although likely syntactically incorrect of what I am trying to do.
For code observations where casenum is the same, within those casenums
I would like it to determine, if data for code 0234 + data for code 0100 - data for code 0123 ne data for code 0200 then newvariable = 'YES'
In other words I'd like it to test if 53 + 789 - 4597 ne 531. after that and other similar kinds of tests run within casenum 1234567, I'd like it to move onto the next casenum, and run those same tests for that casenum.
Keep in mind this dataset has hundreds of thousands of observations in it.
I'm unclear on what your logic is for the subtraction part of the code, but for the selection of a group of rows I can suggest. At first glance I would obtain a list of distinct values for casenum.
proc sql;
select distinct casenum
into :casenum_list separated by ' '
from dataset;
quit;
Now that you have a list of all distinct casenum values, I would iterate through the rows following whatever logic you need.
Possibly using another proc sql like:
%MACRO DOIT;
%LET COUNT=1;
%DO %UNTIL (%SCAN(&casenum_list,&COUNT) EQ);
%LET CASENUM_VAR=%SCAN(&casenum_list,&COUNT);
PROC SQL;
SELECT
<INSERT SOME SQL LOGIC HERE>
FROM
DATASET
WHERE CASENUM=&CASENUM_VAR;
QUIT;
%LET COUNT=%EVAL(&COUNT+1);
%MEND DOIT;
%DOIT;
I hope this helps. If you can provide more insight into what you are trying to accomplish within the rows, I can be more specific.
If the formula is fixed (as your example seems to suggest), then there shouldn't be any reason that you can't do a straightforward transpose and then declare the test explicitly.
/* Transpose the data by casenum */
proc transpose data=so846572 out=transpose_ds;
id code;
var data;
by casenum;
run;
/* Now just explicitly write your conditional expression */
data StackOverflow;
set transpose_ds;
if _0234 + _0100 - _0123 <> _0200 then newvariable="yes";
run;
Where so846572 = Your original dataset, transpose_ds = Transposed version, StackOverflow = final output.
Let us know if this expression needs to be dynamic for some reason. This should easily scale to the volume of data you've mentioned without any problems. You could conceivably do the same kind of thing with a hash as well in one pass of the data.
I don't think I really have enough info from your question to help, but I will just throw this out....
If you want to do row comparison, you can also use the data step. Assuming you have your data sorted by casenum you can use first. and last. to determine when you have a new casenum and when you are on the last row of a casenum. If you want to sum up data values between rows or make decisions based on a previous row for a casenum listed multiple times.
Data work.temp ;
retain casenum_data ;
set lib.data ;
by casenum ;
if first.casenum then do ;
/* <reset hold vars> */
casenum_data = 0 ;
end ;
if code = "0200" or code = "234" then .....
if last.casenum then do ;
/* output casenum summary */
output ;
end ;
run ;
Post more info about need and more help can be given.