I have a dataset which has three variables: Application number, decline code and sequence. Now, there may be multiple decline code for a single application(which will have different sequence number). So the data looks like following:
Application No Decline Code Sequence
1234 FG 1
1234 FK 3
1234 AF 2
1256 AF 2
1256 FK 1
.
.
.
.
And so on
So, I have to put this in wide format such that the first column contains unique application numbers and corresponding to each of them is their decline code(I don't need sequence number, just that decline codes should appear in order of their sequence number from left to right, separated by a comma). Something like below
Application Number Decline Code
1234 FG, AF, FK
1256 FK, AF
..........
.........
And so on
Now I tried ruining proc transpose by application number on SAS. But the problem is that it creates multiple columns with all the decline codes listed and then if a certain decline code doesn't apply for an application, it will show . in that. So their are many missing values and it isn't quite the format I am expecting. Is there any way to do this in SAS or sql?
PROC TRANSPOSE can certainly help here; then you can CATX the variables together if you really just want one variable:
data have;
input ApplicationNo DeclineCode $ Sequence ;
datalines;
1234 FG 1
1234 FK 3
1234 AF 2
1256 AF 2
1256 FK 1
;;;;
run;
proc sort data=have;
by ApplicationNo Sequence;
run;
proc transpose data=have out=want_pre;
by ApplicationNo;
var DeclineCode;
run;
data want;
set want_pre;
length decline_codes $1024;
decline_codes = catx(', ',of col:);
keep ApplicationNo decline_codes;
run;
You could also do this trivially in one datastep, using first and last checks.
data want_ds;
set have;
by ApplicationNo Sequence;
retain decline_codes;
length decline_codes $1024; *or whatever you need;
if first.ApplicationNo then call missing(decline_codes);
decline_codes = catx(',',decline_codes, DeclineCode);
if last.ApplicationNo then output;
run;
Related
I want to use SAS and eg. proc report to produce a custom table within my workflow.
Why: Prior, I used proc export (dbms=excel) and did some very basic stats by hand and copied pasted to an excel sheet to complete the report. Recently, I've started to use ODS excel to print all the relevant data to excel sheets but since ODS excel would always overwrite the whole excel workbook (and hence also the handcrafted stats) I now want to streamline the process.
The task itself is actually very straightforward. We have some information about IDs, age, and registration, so something like this:
data test;
input ID $ AGE CENTER $;
datalines;
111 23 A
. 27 B
311 40 C
131 18 A
. 64 A
;
run;
The goal is to produce a table report which should look like this structure-wise:
ID NO-ID Total
Count 3 2 5
Age (mean) 27 45.5 34.4
Count by Center:
A 2 1 3
B 0 1 1
A 1 0 1
It seems, proc report only takes variables as columns but not a subsetted data set (ID NE .; ID =''). Of course I could just produce three reports with three subsetted data sets and print them all separately but I hope there is a way to put this in one table.
Is proc report the right tool for this and if so how should I proceed? Or is it better to use proc tabulate or proc template or...?
I found a way to achieve an almost match to what I wanted. First if all, I had to introduce a new variable vID (valid ID, 0 not valid, 1 valid) in the data set, like so:
data test;
input ID $ AGE CENTER $;
if ID = '' then vID = 0;
else vID = 1;
datalines;
111 23 A
. 27 B
311 40 C
131 18 A
. 64 A
;
run;
After this I was able to use proc tabulate as suggested by #Reeza in the comments to build a table which pretty much resembles what I initially aimed for:
proc tabulate data = test;
class vID Center;
var age;
keylabel N = 'Count';
table N age*mean Center*N, vID ALL;
run;
Still, I wonder if there is a way without introducing the new variable at all and just use the SAS counters for missing and non-missing observations.
UPDATE:
#Reeza pointed out to use the proc format to assign a value to missing/non-missing ID data. In combination with the missing option (prints missing values) in proc tabulate this delivers the output without introducing a new variable:
proc format;
value $ id_fmt
' ' = 'No-ID'
other = 'ID'
;
run;
proc tabulate data = test missing;
format ID $id_fmt.;
class ID Center;
var age;
keylabel N = 'Count';
table N age*(mean median) Center*N, (ID=' ') ALL;
run;
original output
Count
AAB BB
01NOV2014 5 4
02NOV2014 4 3
But ideal output is
Count
BB AAB
01NOV2014 4 5
02NOV2014 4 4
Is there a way to change a n by k tables from proc tabulate to list it as requested?
Since k is not small, I'm looking for an efficient way to achieve this. Maybe store the requested order in a macro variable?
The easiest answer depends on how the order is derived.
You have some ordering options on the class variable, such as order=data, which may give you the desired result if the data is stored in that order. This can be tricky, but sometimes is a simple method to get to that result.
Second, you have a couple of options related to formats.
If the data can be stored as a formatted numeric, where BB=1, AAB=2, etc., then use order=unformatted to achieve that.
Create a format that lists the values in order, just formatting them to themselves, with notsorted in the options of the value statement, and then use order=data on the class statement and preloadfmt.
Example of the second option:
data have;
input var $ count;
datalines;
AAA 1
AAB 2
BBA 3
BBB 4
;;;;
run;
proc format;
value $myformatf (notsorted)
BBB=BBB
AAB=AAB
BBA=BBA
AAA=AAA
other=' ';
quit;
proc tabulate data=have;
class var/order=data preloadfmt;
format var $myformatf.;
var count;
tables var,count*sum;
run;
I'm fairly new with SAS and am looking for a little guidance.
I have two tables. One contains my data (something like the below, although much larger):
Data DataTable;
Input Var001 $ Var002;
Datalines;
000 050
063 052
015 017
997 035;
run;
My variables are integers (read in as text) from 000 to 999. There can be as few as two, or as many as 500 depending on what the user is doing.
The second table contains user specified groupings of the variables in the DataTable:
Data Var_Groupings;
input var $ range $ Group_Desc $;
Datalines;
001 025 0-25
001 075 26-75
001 999 76-999
002 030 0-30
002 050 31-50
002 060 51-60
002 999 61-999;
run;
(In actuality, this table in adjusted by the user in excel and then imported, but this will work for the purposes of troubleshooting).
The "var" variable in the var_groupings table corresponds to a var column in the DataTable. So for instance a "var" of 001 in the var_groupings table is saying that this grouping will be on var001 of the DataTable.
the "Range" variable specifics the upper bound of a grouping. So looking at ranges in the var_grouping table where var is equal to 001, the user wants the first group to span from 0 to 25, the second group to span from 26 to 75, and the last group to span from 76 to 999.
EDIT: The Group_Desc column can contain any string and is not necessarily of the form presented here.
the final table should look something like this:
Var001 Var002 Var001_Group Var002_group
000 050 0-25 31-50
063 052 26-75 51-60
015 017 0-25 0-30
997 035 76-999 31-50
I'm not sure how I would even approach something like this. Any guidance you can give would be greatly appreciated.
That's an interesting one, thanks! It can be solved using CALL EXECUTE, since we need to create variable names from values. And obviously PROC FORMAT is the easiest way to convert some values into ranges. So, combining these two things we can do something like this:
proc sort data=Var_Groupings; by var range; run;
/*create dataset which will be the source of our formats' descriptions*/
data formatset;
set Var_Groupings;
by var;
fmtname='myformat';
type='n';
label=Group_Desc;
start=input(lag(range),8.)+1;
end=input(range,8.);
if FIRST.var then start=0;
drop range Group_Desc;
run;
/*put the raw data into new one, which we'll change to get what we want (just to avoid
changing the raw one)*/
data want;
set Datatable;
run;
/*now we iterate through all distinct variable numbers. A soon as we find new number
we generate with CALL EXECUTE three steps: PROC FORMAT, DATA-step to apply this format
to a specific variable, and then PROC CATALOG to delete format*/
data _null_;
set formatset;
by var;
if FIRST.var then do;
call execute(cats("proc format library=work cntlin=formatset(where=(var='",var,"')); run;"));
call execute("data want;");
call execute("set want;");
call execute(cats('_Var',var,'=input(var',var,',8.);'));
call execute(cats('Var',var,'_Group=put(_Var',var,',myformat.);'));
call execute("drop _:;");
call execute("proc catalog catalog=work.formats; delete myformat.format; run;");
end;
run;
UPDATE. I've changed the first DATA-step (for creating formatset) so that now end and start for each range is taken from variable range, not Group_Desc. And PROC SORT moved to the beginning of the code.
I'm new to SAS, and would greatly appreciate anyone who can help me formulate a code. Can someone please help me with formatting changing arrays based on the first column values?
So basically here's the original data:
Category Name1 Name2......... (Changes invariably)
#ofpeople 20 30
#ofproviders 10 5
#ofclaims 40 25
AmountBilled 50 100
AmountPaid 11 35
AmountDed 5 6
I would like to format the values under Name1 to infinite Name# and reformat them to dollar10.2 for any values under Category called 'AmountBilled','AmountPaid','AmountDed'.
Thank you so much for your help!
You can't conditionally format a column (like you might in excel). A variable/column has one format for the entire column. There are tricks to get around this, but they're invariably more complex than should be considered useful.
You can store the formatted value in a character variable, but it loses the ability to do math.
data have;
input category :$10. name1 name2;
datalines;
#ofpeople 20 30
#ofproviders 10 5
#ofclaims 40 25
AmountBilled 50 100
AmountPaid 11 35
AmountDed 5 6
;;;;
run;
data want;
set have;
array names name:; *colon is wildcard (starts with);
array newnames $10 newname1-newname10; *Arbitrarily 10, can be whatever;
if substr(category,1,6)='Amount' then do;
do _t = 1 to dim(names);
newnames[_t] = put(names[_t],dollar10.2);
end;
end;
run;
You could programmatically figure out the newname1000 endpoint using PROC CONTENTS or SQL's DICTIONARY.COLUMNS / SAS's SASHELP.VCOLUMN. Alternately, you could put out the original dataset as a three column dataset with many rows for each category (was it this way to begin with prior to a PROC TRANSPOSE?) and put the character variable there (not needing an array). To me that's the cleanest option.
data have_t;
set have;
array names name:;
format nameval $10.;
do namenum = 1 to dim(names);
if substr(category,1,6)='Amount' then nameval = put(names[namenum],dollar10.2 -l);
else nameval=put(names[namenum],10. -l); *left aligning here, change this if you want otherwise;
output; *now we have (namenum) rows per line. Test for missing(name) if you want only nonmissing rows output (if not every row has same number of names).
end;
run;
proc transpose data=have_t out=want_T(drop=_name_) prefix=name;
by category notsorted;
var nameval;
run;
Finally, depending on what you're actually doing with this, you may have superior options in terms of the output method. If you're doing PROC REPORT for example, you can use compute blocks to set the style (format) of the column conditionally in the report output.
I have a situation where for each unique observation of casenum I would like to run varies queries and arithmatic operations between various observations of 'code' for that 'casenum' (see below). For example for casenum 1234567 I would like subtract data for code 0200 - code 0234 or 531 - 53. Please keep in mind that there are thousands of observations in this dataset. Is there an easy way to do this or to do row comparisons with the particular.
Please note casenum and code are character variables and data is a numeric variable
Here is an example of how the dataset is structured:
casenum code data
1234567 0123 4597
1234567 0234 53
1234567 0100 789
1234567 0200 531
1234567 0300 354
1111112 0123 79
1111112 0234 78
1111112 0100 77
1111112 0200 7954
1111112 0300 35
Here is the logic although likely syntactically incorrect of what I am trying to do.
For code observations where casenum is the same, within those casenums
I would like it to determine, if data for code 0234 + data for code 0100 - data for code 0123 ne data for code 0200 then newvariable = 'YES'
In other words I'd like it to test if 53 + 789 - 4597 ne 531. after that and other similar kinds of tests run within casenum 1234567, I'd like it to move onto the next casenum, and run those same tests for that casenum.
Keep in mind this dataset has hundreds of thousands of observations in it.
I'm unclear on what your logic is for the subtraction part of the code, but for the selection of a group of rows I can suggest. At first glance I would obtain a list of distinct values for casenum.
proc sql;
select distinct casenum
into :casenum_list separated by ' '
from dataset;
quit;
Now that you have a list of all distinct casenum values, I would iterate through the rows following whatever logic you need.
Possibly using another proc sql like:
%MACRO DOIT;
%LET COUNT=1;
%DO %UNTIL (%SCAN(&casenum_list,&COUNT) EQ);
%LET CASENUM_VAR=%SCAN(&casenum_list,&COUNT);
PROC SQL;
SELECT
<INSERT SOME SQL LOGIC HERE>
FROM
DATASET
WHERE CASENUM=&CASENUM_VAR;
QUIT;
%LET COUNT=%EVAL(&COUNT+1);
%MEND DOIT;
%DOIT;
I hope this helps. If you can provide more insight into what you are trying to accomplish within the rows, I can be more specific.
If the formula is fixed (as your example seems to suggest), then there shouldn't be any reason that you can't do a straightforward transpose and then declare the test explicitly.
/* Transpose the data by casenum */
proc transpose data=so846572 out=transpose_ds;
id code;
var data;
by casenum;
run;
/* Now just explicitly write your conditional expression */
data StackOverflow;
set transpose_ds;
if _0234 + _0100 - _0123 <> _0200 then newvariable="yes";
run;
Where so846572 = Your original dataset, transpose_ds = Transposed version, StackOverflow = final output.
Let us know if this expression needs to be dynamic for some reason. This should easily scale to the volume of data you've mentioned without any problems. You could conceivably do the same kind of thing with a hash as well in one pass of the data.
I don't think I really have enough info from your question to help, but I will just throw this out....
If you want to do row comparison, you can also use the data step. Assuming you have your data sorted by casenum you can use first. and last. to determine when you have a new casenum and when you are on the last row of a casenum. If you want to sum up data values between rows or make decisions based on a previous row for a casenum listed multiple times.
Data work.temp ;
retain casenum_data ;
set lib.data ;
by casenum ;
if first.casenum then do ;
/* <reset hold vars> */
casenum_data = 0 ;
end ;
if code = "0200" or code = "234" then .....
if last.casenum then do ;
/* output casenum summary */
output ;
end ;
run ;
Post more info about need and more help can be given.