I am trying to find a quick way to replace missing values with the average of the two nearest non-missing values. Example:
Id Amount
1 10
2 .
3 20
4 30
5 .
6 .
7 40
Desired output
Id Amount
1 10
2 **15**
3 20
4 30
5 **35**
6 **35**
7 40
Any suggestions? I tried using the retain function, but I can only figure out how to retain last non-missing value.
I thinks what you are looking for might be more like interpolation. While this is not mean of two closest values, it might be useful.
There is a nifty little tool for interpolating in datasets called proc expand. (It should do extrapolation as well, but I haven't tried that yet.) It's very handy when making series of of dates and cumulative calculations.
data have;
input Id Amount;
datalines;
1 10
2 .
3 20
4 30
5 .
6 .
7 40
;
run;
proc expand data=have out=Expanded;
convert amount=amount_expanded / method=join;
id id; /*second is column name */
run;
For more on the proc expand see documentation: https://support.sas.com/documentation/onlinedoc/ets/132/expand.pdf
This works:
data have;
input id amount;
cards;
1 10
2 .
3 20
4 30
5 .
6 .
7 40
;
run;
proc sort data=have out=reversed;
by descending id;
run;
data retain_non_missing;
set reversed;
retain next_non_missing;
if amount ne . then next_non_missing = amount;
run;
proc sort data=retain_non_missing out=ordered;
by id;
run;
data final;
set ordered;
retain last_non_missing;
if amount ne . then last_non_missing = amount;
if amount = . then amount = (last_non_missing + next_non_missing) / 2;
run;
but as ever, will need extra error checking etc for production use.
The key idea is to sort the data into reverse order, allowing it to use RETAIN to carry the next_non_missing value back up the data set. When sorted back into the correct order, you then have enough information to interpolate the missing values.
There may well be a PROC to do this in a more controlled way (I don't know anything about PROC STANDARDIZE, mentioned in Reeza's comment) but this works as a data step solution.
Here's an alternative requiring no sorting. It does require IDs to be sequential, though that can be worked around if they're not.
What it does is uses two set statements, one that gets the main (and previous) amounts, and one that sets until the next amount is found. Here I use the sequence of id variables to guarantee it will be the right record, but you could write this differently if needed (keeping track of what loop you're on) if the id variables aren't sequential or in an order of any sort.
I use the first.amount check to make sure we don't try to execute the second set statement more than we should (which would terminate early).
You need to do two things differently if you want first/last rows treated differently. Here I assume prev_amount is 0 if it's the first row, and I assume last_amount is missing, meaning the last row just gets the last prev_amount repeated, while the first row is averaged between 0 and the next_amount. You can treat either one differently if you choose, I don't know your data.
data have;
input Id Amount;
datalines;
1 10
2 .
3 20
4 30
5 .
6 .
7 40
;;;;
run;
data want;
set have;
by amount notsorted; *so we can tell if we have consecutive missings;
retain prev_amount; *next_amount is auto-retained;
if not missing(amount ) then prev_amount=amount;
else if _n_=1 then prev_amount=0; *or whatever you want to treat the first row as;
else if first.amount then do;
do until ((next_id > id and not missing(next_amount)) or (eof));
set have(rename=(id=next_id amount=next_amount)) end=eof;
end;
amount = mean(prev_amount,next_amount);
end;
else amount = mean(prev_amount,next_amount);
run;
Related
How can I convert my SAS data set, into a data set that I can easily paste into the forum or hand over to someone to replicate my data. Ideally, I'd also like to be able to control the amount of records that are included.
Ie I have sashelp.class in the SASHELP library, but I want to provide it here so others can use it as the starting point for my question.
To do this, you can use a macro written by Mark Jordan at SAS, the code is stored in GitHub as well.
You need to provide the data set name, including library and the number of observations you want to output. It takes them in order. The code will then appear in your SAS log.
*data set you want to create demo data for;
%let dataSetName = sashelp.Class;
*number of observations you want to keep;
%let obsKeep = 5;
******************************************************
DO NOT CHANGE ANYTHING BELOW THIS LINE
******************************************************;
%let source_path = https://gist.githubusercontent.com/statgeek/bcc55940dd825a13b9c8ca40a904cba9/raw/865d2cf18f5150b8e887218dde0fc3951d0ff15b/data2datastep.sas;
filename reprex url "&source_path";
%include reprex;
filename reprex;
option linesize=max;
%data2datastep(dsn=&dataSetName, obs=&obsKeep);
This may not work if you do not have access to the github page, in that case, you can manually navigate to the page (same link) and copy/paste it into SAS. Then run the program and run only the last step, the %data2datastep(dsn=, obs=);
This topic came up recently on SAS Communities and I created a little more robust macro than the one Reeza linked. You can see it in Github: ds2post.sas
* Pull macro definition from GITHUB ;
filename ds2post url
'https://raw.githubusercontent.com/sasutils/macros/master/ds2post.sas'
;
%include ds2post ;
For example if you wanted to share the first 5 observations of SASHELP.CARS you would run this macro call:
%ds2post(sashelp.cars,obs=5)
Which would generate this code to the SAS log:
data work.cars (label='2004 Car Data');
infile datalines dsd dlm='|' truncover;
input Make :$13. Model :$40. Type :$8. Origin :$6. DriveTrain :$5.
MSRP Invoice EngineSize Cylinders Horsepower MPG_City MPG_Highway
Weight Wheelbase Length
;
format MSRP dollar8. Invoice dollar8. ;
label EngineSize='Engine Size (L)' MPG_City='MPG (City)'
MPG_Highway='MPG (Highway)' Weight='Weight (LBS)'
Wheelbase='Wheelbase (IN)' Length='Length (IN)'
;
datalines4;
Acura|MDX|SUV|Asia|All|36945|33337|3.5|6|265|17|23|4451|106|189
Acura|RSX Type S 2dr|Sedan|Asia|Front|23820|21761|2|4|200|24|31|2778|101|172
Acura|TSX 4dr|Sedan|Asia|Front|26990|24647|2.4|4|200|22|29|3230|105|183
Acura|TL 4dr|Sedan|Asia|Front|33195|30299|3.2|6|270|20|28|3575|108|186
Acura|3.5 RL 4dr|Sedan|Asia|Front|43755|39014|3.5|6|225|18|24|3880|115|197
;;;;
Try this little test to compare the two macros.
First make a sample dataset with a couple of issues.
data testit;
set sashelp.class (obs=5);
if _n_=1 then name='Le Bron';
if _n_=2 then age=.;
if _n_=3 then wt=.;
if _n_=4 then name='12;34';
run;
Then run both macros to dump code to the SAS log.
%ds2post(testit);
%data2datastep(dsn=testit,obs=20);
Copy the code from the log. Changing the name in the DATA statements to not overwrite the original dataset or each other. Run them and compare the result to the original.
proc compare data=testit compare=testit1; run;
proc compare data=testit compare=testit2; run;
Result using %DS2POST:
The COMPARE Procedure
Comparison of WORK.TESTIT with WORK.TESTIT1
(Method=EXACT)
Data Set Summary
Dataset Created Modified NVar NObs
WORK.TESTIT 02NOV18:17:09:40 02NOV18:17:09:40 6 5
WORK.TESTIT1 02NOV18:17:10:29 02NOV18:17:10:29 6 5
Variables Summary
Number of Variables in Common: 6.
Observation Summary
Observation Base Compare
First Obs 1 1
Last Obs 5 5
Number of Observations in Common: 5.
Total Number of Observations Read from WORK.TESTIT: 5.
Total Number of Observations Read from WORK.TESTIT1: 5.
Number of Observations with Some Compared Variables Unequal: 0.
Number of Observations with All Compared Variables Equal: 5.
Summary of results using %Data2DataStep:
Comparison of WORK.TESTIT with WORK.TESTIT2
(Method=EXACT)
Data Set Summary
Dataset Created Modified NVar NObs
WORK.TESTIT 02NOV18:17:09:40 02NOV18:17:09:40 6 5
WORK.TESTIT2 02NOV18:17:10:29 02NOV18:17:10:29 6 3
Variables Summary
Number of Variables in Common: 6.
Observation Summary
Observation Base Compare
First Obs 1 1
First Unequal 1 1
Last Unequal 3 3
Last Match 3 3
Last Obs 5 .
Number of Observations in Common: 3.
Number of Observations in WORK.TESTIT but not in WORK.TESTIT2: 2.
Total Number of Observations Read from WORK.TESTIT: 5.
Total Number of Observations Read from WORK.TESTIT2: 3.
Number of Observations with Some Compared Variables Unequal: 3.
Number of Observations with All Compared Variables Equal: 0.
Variable Values Summary
Values Comparison Summary
Number of Variables Compared with All Observations Equal: 1.
Number of Variables Compared with Some Observations Unequal: 5.
Number of Variables with Missing Value Differences: 4.
Total Number of Values which Compare Unequal: 12.
Maximum Difference: 0.
Variables with Unequal Values
Variable Type Len Ndif MaxDif MissDif
Name CHAR 8 1 0
Sex CHAR 1 3 3
Age NUM 8 2 0 2
Height NUM 8 3 0 3
Weight NUM 8 3 0 3
Note that I am sure there are values that will cause trouble for my macro also. But hopefully they are caused by data that is less likely to occur than spaces or semi-colons.
I want converted my data from long to wide format using data step. The problem is that due to missing values the values are not placed in the correct cells. I think to solve the problem I have to include placeholder for missing values.
The problem is I don't know how to do. Can someone please give me tip on how to go about it.
data tic;
input id country$ month math;
datalines;
1 uk 1 10
1 uk 2 15
1 uk 3 24
2 us 2 15
2 us 4 12
3 fl 1 15
3 fl 2 16
3 fl 3 17
3 fl 4 15
;
run;
proc sort data=tic;
by id;
run;
data tot(drop=month math);
retain month1-month4 math1-math4;
array tat{4} month1-month4;
array kat{4} math1-math4;
set tic;
by id;
if first.id then do;
i=1;
do j=1 to 4;
tat{j}=.;
kat{j}=.;
end;
end;
tat(i)=month;
kat(i)=math;
if last.id then output;
i+1;
run;
Edit
I finally figured out what the problem is:
changed this lines of code
tat(i)=month;
kat(i)=math;
to:
tat(month)=month;
kat(month)=math;
and it fixed the problem.
Data transformations from tall and skinny to short and wide often mean that categorical data ends up as column names. This is a process of moving data to metadata, which can be a problem later on for dealing with BY or CLASS groups.
SAS has Proc TABULATE and Proc REPORT for creating pivoted output. Proc TRANSPOSE is also a good standard way of creating pivoted data.
I did notice that you are pivoting two columns at once. TRANSPOSE can't multi-pivot. The DATA Step approach you showed is a typical way for doing a transpose transform when the indices lie within known ranges. In your case the array declaration must be such that 'direct-addressing' via index can to handle the minimal and maximal month values that occur over all the data.
The question might be quite vague but I could not come up with a decent concise title.
I have data where there are id ,date, amountA and AmtB as my variables. The task is to pick the dates that are within 10 days of each other and then see if their amountA are within 20% and if they are then pick the one with highest amountB. I have used to this code to achieve this
id date amountA amountB
1 1/15/2014 1000 79
1 1/16/2014 1100 81
1 1/30/2014 700 50
1 2/05/2014 710 80
1 2/25/2014 720 50
This is what I need
id date amountA amountB
1 1/16/2014 1100 81
1 1/30/2014 700 50
1 2/25/2014 720 50
I wrote this code but the problem with this code is its not automatic and has to be done on a case to case basis.I need a way to loop it so that it automatically outputs the results.I am no pro at looping and hence am stuck.Any help is greatly appreciated
data test2;
set test1;
diff_days=abs(intck('days',first_dt,date));
if diff_days<=10 then flag=1;
else if diff_days>10 then flag=0;
run;
data test3 rem_test3;
set test2;
if flag=1 then output test3;
else output rem_test3;
run;
proc sort data=test3;
by id amountA;
run;
data all_within;
set test3;
by id amountA;
amtA_lag=lag1(amountA);
if first.id then
do;
counter=1;
flag1=1;
end;
if first.id=0 then
do;
counter+1;
diff=abs(amountA-amtA_lag);
if diff<(10/100*amountA) then flag1+1;
else flag1=0;
end;
if last.stay and flag1=counter then output all_within;
run;
If I understand the problem correctly, you want to group all records together that have (no skip of 10+ days) and (amt A w/in 20%)?
Looping isn't your problem - no explicitly coded loop is needed to do this (or at least, the way I think of it). SAS does the data step loop for you.
What you want to do is:
Identify groups. A group is the consecutive records that you want to, among them, collapse to one row. It's not perfectly clear to me how amountA has to behave here - does the whole group need to have less than a maximum difference of 10%, or a record to next record difference of < 10%, or a (current highest amtB of group) < 10% - but you can easily identify all of these rules. Use a RETAINed variable to keep track of the previous amountA, previous date, highest amountB, date associated with the highest amountB, amountA associated with highest amountB.
When you find a record that doesn't fit in the current group, output a record with the values of the previous group.
You shouldn't need two steps for this, although you can if you want to see it more easily - this may be helpful for debugging your rules. Set it so that you have a GroupNum variable, which you RETAIN, and you increment that any time you see a record that causes a new group to start.
I had trouble figuring out the rules...but here is some code that checks each record against the previous for the criteria I think you want.
Data HAVE;
input id date :mmddyy10. amountA amountB ;
format date mmddyy10.;
datalines;
1 1/15/2014 1000 79
1 1/16/2014 1100 81
1 1/30/2014 700 50
1 2/05/2014 710 80
1 2/25/2014 720 50
;
Proc Sort data=HAVE;
by id date;
Run;
Data WANT(drop=Prev_:);
Set HAVE;
Prev_Date=lag(date);
Prev_amounta=lag(amounta);
Prev_amountb=lag(amountb);
If not missing(prev_date);
If date-prev_date<=10 then do;
If (amounta-prev_amounta)/amounta<=.1 then;
If amountb<prev_amountb then do;
Date=prev_date;
AmountA=prev_amounta;
AmountB=prev_amountb;
end;
end;
Else delete;
Run;
Here is a method that I think should work. The basic approach is:
Find all the pairs of sufficiently close observations
Join the pairs with themselves to get all connected ids
Reduce the groups
Join to the original data and get the desired values
data have;
input
id
date :mmddyy10.
amountA
amountB;
format date mmddyy10.;
datalines;
1 1/15/2014 1000 79
2 1/16/2014 1100 81
3 1/30/2014 700 50
4 2/05/2014 710 80
5 2/25/2014 720 50
;
run;
/* Count the observations */
%let dsid = %sysfunc(open(have));
%let nobs = %sysfunc(attrn(&dsid., nobs));
%let rc = %sysfunc(close(&dsid.));
/* Output any connected pairs */
data map;
array vals[3, &nobs.] _temporary_;
set have;
/* Put all the values in an array for comparison */
vals[1, _N_] = id;
vals[2, _N_] = date;
vals[3, _N_] = amountA;
/* Output all pairs of ids which form an acceptable pair */
do i = 1 to _N_;
if
abs(vals[2, i] - date) < 10 and
abs((vals[3, i] - amountA) / amountA) < 0.2
then do;
id2 = vals[1, i];
output;
end;
end;
keep id id2;
run;
proc sql;
/* Reduce the connections into groups */
create table groups as
select
a.id,
min(min(a.id, a.id2, b.id)) as group
from map as a
left join map as b
on a.id = b.id2
group by a.id;
/* Get the final output */
create table lookup (where = (amountB = maxB)) as
select
have.*,
groups.group,
max(have.amountB) as maxB
from have
left join groups
on have.id = groups.id
group by groups.group;
quit;
The code works for the example data. However, the group reduction is insufficient for more complicated data. Fortunately, approaches for finding all the subgraphs given a set of edges can be found here, here, here or here (using SAS/OR).
I'm new to SAS, and would greatly appreciate anyone who can help me formulate a code. Can someone please help me with formatting changing arrays based on the first column values?
So basically here's the original data:
Category Name1 Name2......... (Changes invariably)
#ofpeople 20 30
#ofproviders 10 5
#ofclaims 40 25
AmountBilled 50 100
AmountPaid 11 35
AmountDed 5 6
I would like to format the values under Name1 to infinite Name# and reformat them to dollar10.2 for any values under Category called 'AmountBilled','AmountPaid','AmountDed'.
Thank you so much for your help!
You can't conditionally format a column (like you might in excel). A variable/column has one format for the entire column. There are tricks to get around this, but they're invariably more complex than should be considered useful.
You can store the formatted value in a character variable, but it loses the ability to do math.
data have;
input category :$10. name1 name2;
datalines;
#ofpeople 20 30
#ofproviders 10 5
#ofclaims 40 25
AmountBilled 50 100
AmountPaid 11 35
AmountDed 5 6
;;;;
run;
data want;
set have;
array names name:; *colon is wildcard (starts with);
array newnames $10 newname1-newname10; *Arbitrarily 10, can be whatever;
if substr(category,1,6)='Amount' then do;
do _t = 1 to dim(names);
newnames[_t] = put(names[_t],dollar10.2);
end;
end;
run;
You could programmatically figure out the newname1000 endpoint using PROC CONTENTS or SQL's DICTIONARY.COLUMNS / SAS's SASHELP.VCOLUMN. Alternately, you could put out the original dataset as a three column dataset with many rows for each category (was it this way to begin with prior to a PROC TRANSPOSE?) and put the character variable there (not needing an array). To me that's the cleanest option.
data have_t;
set have;
array names name:;
format nameval $10.;
do namenum = 1 to dim(names);
if substr(category,1,6)='Amount' then nameval = put(names[namenum],dollar10.2 -l);
else nameval=put(names[namenum],10. -l); *left aligning here, change this if you want otherwise;
output; *now we have (namenum) rows per line. Test for missing(name) if you want only nonmissing rows output (if not every row has same number of names).
end;
run;
proc transpose data=have_t out=want_T(drop=_name_) prefix=name;
by category notsorted;
var nameval;
run;
Finally, depending on what you're actually doing with this, you may have superior options in terms of the output method. If you're doing PROC REPORT for example, you can use compute blocks to set the style (format) of the column conditionally in the report output.
To my disappointment, the following code, which sums up 'value' by week from 'master' for weeks which appear in 'transaction' does not work -
data master;
input week value;
datalines;
1 10
1 20
1 30
2 40
2 40
2 50
3 15
3 25
3 35
;
run;
data transaction;
input change_week ;
datalines;
1
3
;
run;
data _null_;
set transaction;
do until(done);
set master end=done;
where week=change_week;
sum = sum(value, sum);
end;
file print;
put week= sum=;
run;
SAS complains, rightly, because it doesn't see 'change_week' in master and does not know how to operate on it.
Surely there must be a way of doing some operation on a subset of a master set (of course, suitably indexed), given a transaction dataset... Does any one know?
I believe this is the closest answer to what the asker has requested.
This method uses an index on week on the large dataset, allowing for the possibility of invalid week values in the transaction dataset, and without requiring either dataset to be sorted in any particular order. Performance will probably be better if the master dataset is in week order.
For small transaction datasets, this should perform quite a lot better than the other solutions as it only retrieves the required observations from the master dataset. If you're dealing with > ~30% of the records in the master dataset in a single transaction dataset, Quentin's method may sometimes perform better due to the overhead of using the index.
data master(index = (week));
input week value;
datalines;
1 10
1 20
1 30
2 40
2 40
2 50
3 15
3 25
3 35
;
run;
data transaction;
input week ;
datalines;
1
3
4
;
run;
data _null_;
set transaction;
file print;
do until(done);
set master key = week end=done;
/*Prevent implicit retain from previous row if the key isn't found,
or we've read past the last record for the current key*/
if _IORC_ ne 0 then do;
_ERROR_ = 0;
call missing(value);
end;
else sum = sum(value, sum);
end;
put week= sum=;
run;
N.B. for this to work, the indexed variable in the master dataset must have exactly the same name and type as the variable in the transaction dataset. Also, the index must be of the non-unique variety in order to accommodate multiple rows with the same key value.
Also, it is possible to replace the set master... statement with an equivalent modify master... statement if you want to apply transactional changes directly, i.e. without SAS making a massive temp file and replacing the original.
You are correct, there are many ways to do this in SAS. Your example is inefficient because (once we got it working) it would still require a full read of "master" for ever line of "transaction".
(The reason you got the error was because you used where instead of if. In SAS, the sub-setting where in a data step is only aware of columns already existing within the data set it's sub-setting. They keep two options because there where is faster when it's usable.)
An alternative solution would be use proc sql. Hopefully this example is self-explanatory:
proc sql;
select
a.change_week,
sum(b.value) as value
from
transaction as a,
master as b
where a.change_week = b.week
group by change_week;
quit;
I don't suggest below solution (would like #Jeff's SQL solution or even a hash better). But just for playing with data step logic, I think below approach would work, if you trust that every key in transaction will exist in master. It relies on the fact that both datasets are sorted, so only makes one pass of each dataset.
On first iteration of the DATA step, it reads the first record from the transaction dataset, then keeps reading through the master dataset until it finds all the matching records for that key, then the DATA step loop iterates and it does it again for the next transaction record.
1003 data _null_;
1004 set transaction;
1005 by change_week;
1006
1007 do until(last.week and _found);
1008 set master;
1009 by week;
1010
1011 if week=change_week then do;
1012 sum = sum(value, sum);
1013 _found=1;
1014 end;
1015 end;
1016
1017 *file print;
1018 put week= sum= ;
1019 run;
week=1 sum=60
week=3 sum=75