I have the following dataset:
AGE HSQ PCT
65 1 0.7
65 2 0.2
65 3 0.1
66 1 0.5
66 2 0.25
66 3 0.25
[...]
What I need is to get the followig output:
AGE P1 P2 P3
65 0.7 0.2 0.1
66 0.5 0.25 0.25
[...]
I have been told to adopt LAG and FIRST.AGE or LAST.AGE in order to do that, and to me it seems a good strategy. However I am not able to get the final result.. the (wrong) code I am using is:
DATA OUTPUT;
SET SAMPLE;
BY AGE HSQ;
IF LAST.AGE THEN DO;
P1=LAG2(PCT);
P2=LAG1(PCT);
P3=PCT;
END;
RUN;
But it jumps to previus ages percentages, which is not what I need.. where is the syntax error? Thanks!
Have been told as in this is an assignment to use them, or as in this is the easiest way to do it?
The easiest way to do this is PROC TRANSPOSE:
data have;
input AGE HSQ PCT;
datalines;
65 1 0.7
65 2 0.2
65 3 0.1
66 1 0.5
66 2 0.25
66 3 0.25
;;;;
run;
proc transpose data=have out=want prefix=P;
by age;
var pct;
id hsq;
run;
LAG does not work the way you think it works - it does not give you the value of the previous row; it instead creates a queue and takes the current value of (argument) and gives you the previous value on the queue. So you can't use it in an IF statement like that.
If you for some reason had to do this in a datastep, then you would want to do it like this:
data want;
array p[3];
do _n_ = 1 by 1 until (last.age);
set have;
by age;
p[hsq]=pct;
end;
keep p1-p3 age;
run;
Really no reason to use lag, or any concept of lag; just as you come across values that belong in a place, you assign them to that place, and when you hit last.age then output.
Anybody want to join me in putting in a SASware request to remove the LAG function?
Just for fun, the direct answer to the original question (to show how this could be done):
DATA want;
SET have;
BY AGE HSQ;
p1=lag2(pct);
p2=lag1(pct);
p3=pct;
if last.age then output;
run;
This goes over a lot of extra work (by a lot I mean a few nanoseconds of CPU time, of course) because it calculates the lags six times and only outputs two of the results. It also is a bit 'risky' because it doesn't check to make sure HSQ is the correct value - ie, if you missed one entry for an age, and only had 2 rows for it, you'd have the previous age's HSQ=3 value for P1, which is probably not desired.
The ultimate point is that with LAG, if you do intend to use it as a stand-in for "previous row's record", you need to keep it outside of conditional blocks. Calculate the lag for every row, and use the result conditionally (in this case, output is used conditionally).
Related
I want to store an instance of a data step variable in a macro-variable using call symput, then use that macro-variable in the same data step to populate a new field, assigning it a new value every 36 records.
I tried the following code:
data a;
set a;
if MOB = 1 then do;
MOB1_accounts = accounts;
call symput('MOB1_acct', MOB1_accounts);
end;
else if MOB > 1 then MOB1_accounts = &MOB1_acct.;
run;
I have a series of repeating MOB's (1-36). I want to create a field called MOB1_Accts, set it equal to the # of accounts for that cohort where MOB = 1, and keep that value when MOB = 2, 3, 4 etc. I basically want to "drag down" the MOB 1 value every 36 records.
For some reason this macro-variable is returning "1" instead of the correct # accounts. I think it might be a char/numeric issue but unsure. I've tried every possible permutation of single quotes, double quotes, symget, etc... no luck.
Thanks for the help!
You are misusing the macro system.
The ampersand (&) introducer in source code tells SAS to resolve the following symbol and place it into the code submission stream. Thus, the resolved &MOB1_acct. can not be changed in the running DATA Step. In other words, a running step can not change it's source code -- The resolved macro variable will be the same for all implicit iterations of the step because its value became part of the source code of the step.
You can use SYMPUT() and SYMGET() functions to move strings out of and into a DATA Step. But that is still the wrong approach for your problem.
The most straight forward technique could be
use of a retained variable
mod (_n_, 36) computation to determine every 36th row. (_n_ is a proxy for row number in a simple step with a single SET.)
Example:
data a;
set a;
retain mob1_accounts;
* every 36 rows change the value, otherwise the value is retained;
if mod(_n_,36) = 1 then mob1_accounts = accounts;
run;
You didn't show any data, so the actual program statements you need might be slightly different.
Contrasting SYMPUT/SYMGET with RETAIN
As stated, SYMPUT/SYMGET is a possible way to retain values by off storing them in the macro symbol table. There is a penalty though. The SYM* requires a function call and whatever machinations/blackbox goings on are happening to store/retrieve a symbol value, and possibly additional conversions between character and numeric.
Example:
1,000,000 rows read. DATA _null_ steps to avoid writing overhead as part of contrast.
data have;
do rownum = 1 to 1e6;
mob + 1;
accounts = sum(accounts, rand('integer', 1,50) - 10);
if mob > 36 then mob = 1;
output;
end;
run;
data _null_;
set have;
if mob = 1 then call symput ('mob1_accounts', cats(accounts));
mob1_accounts = symgetn('mob1_accounts');
run;
data _null_;
set have;
retain mob1_accounts;
if mob = 1 then mob1_accounts = accounts;
run;
On my system logs
142 data _null_;
143 set have;
144
145 if mob = 1 then call symput ('mob1_accounts', cats(accounts));
146
147 mob1_accounts = symgetn('mob1_accounts');
148 run;
NOTE: There were 1000000 observations read from the data set WORK.HAVE.
NOTE: DATA statement used (Total process time):
real time 0.34 seconds
cpu time 0.34 seconds
149
150 data _null_;
151 set have;
152 retain mob1_accounts;
153
154 if mob = 1 then mob1_accounts = accounts;
155 run;
NOTE: There were 1000000 observations read from the data set WORK.HAVE.
NOTE: DATA statement used (Total process time):
real time 0.04 seconds
cpu time 0.03 seconds
Or
way real cpu
------------- ------ ----
SYMPUT/SYMGET 0.34 0.34
RETAIN 0.04 0.03
I want converted my data from long to wide format using data step. The problem is that due to missing values the values are not placed in the correct cells. I think to solve the problem I have to include placeholder for missing values.
The problem is I don't know how to do. Can someone please give me tip on how to go about it.
data tic;
input id country$ month math;
datalines;
1 uk 1 10
1 uk 2 15
1 uk 3 24
2 us 2 15
2 us 4 12
3 fl 1 15
3 fl 2 16
3 fl 3 17
3 fl 4 15
;
run;
proc sort data=tic;
by id;
run;
data tot(drop=month math);
retain month1-month4 math1-math4;
array tat{4} month1-month4;
array kat{4} math1-math4;
set tic;
by id;
if first.id then do;
i=1;
do j=1 to 4;
tat{j}=.;
kat{j}=.;
end;
end;
tat(i)=month;
kat(i)=math;
if last.id then output;
i+1;
run;
Edit
I finally figured out what the problem is:
changed this lines of code
tat(i)=month;
kat(i)=math;
to:
tat(month)=month;
kat(month)=math;
and it fixed the problem.
Data transformations from tall and skinny to short and wide often mean that categorical data ends up as column names. This is a process of moving data to metadata, which can be a problem later on for dealing with BY or CLASS groups.
SAS has Proc TABULATE and Proc REPORT for creating pivoted output. Proc TRANSPOSE is also a good standard way of creating pivoted data.
I did notice that you are pivoting two columns at once. TRANSPOSE can't multi-pivot. The DATA Step approach you showed is a typical way for doing a transpose transform when the indices lie within known ranges. In your case the array declaration must be such that 'direct-addressing' via index can to handle the minimal and maximal month values that occur over all the data.
I am trying to find a quick way to replace missing values with the average of the two nearest non-missing values. Example:
Id Amount
1 10
2 .
3 20
4 30
5 .
6 .
7 40
Desired output
Id Amount
1 10
2 **15**
3 20
4 30
5 **35**
6 **35**
7 40
Any suggestions? I tried using the retain function, but I can only figure out how to retain last non-missing value.
I thinks what you are looking for might be more like interpolation. While this is not mean of two closest values, it might be useful.
There is a nifty little tool for interpolating in datasets called proc expand. (It should do extrapolation as well, but I haven't tried that yet.) It's very handy when making series of of dates and cumulative calculations.
data have;
input Id Amount;
datalines;
1 10
2 .
3 20
4 30
5 .
6 .
7 40
;
run;
proc expand data=have out=Expanded;
convert amount=amount_expanded / method=join;
id id; /*second is column name */
run;
For more on the proc expand see documentation: https://support.sas.com/documentation/onlinedoc/ets/132/expand.pdf
This works:
data have;
input id amount;
cards;
1 10
2 .
3 20
4 30
5 .
6 .
7 40
;
run;
proc sort data=have out=reversed;
by descending id;
run;
data retain_non_missing;
set reversed;
retain next_non_missing;
if amount ne . then next_non_missing = amount;
run;
proc sort data=retain_non_missing out=ordered;
by id;
run;
data final;
set ordered;
retain last_non_missing;
if amount ne . then last_non_missing = amount;
if amount = . then amount = (last_non_missing + next_non_missing) / 2;
run;
but as ever, will need extra error checking etc for production use.
The key idea is to sort the data into reverse order, allowing it to use RETAIN to carry the next_non_missing value back up the data set. When sorted back into the correct order, you then have enough information to interpolate the missing values.
There may well be a PROC to do this in a more controlled way (I don't know anything about PROC STANDARDIZE, mentioned in Reeza's comment) but this works as a data step solution.
Here's an alternative requiring no sorting. It does require IDs to be sequential, though that can be worked around if they're not.
What it does is uses two set statements, one that gets the main (and previous) amounts, and one that sets until the next amount is found. Here I use the sequence of id variables to guarantee it will be the right record, but you could write this differently if needed (keeping track of what loop you're on) if the id variables aren't sequential or in an order of any sort.
I use the first.amount check to make sure we don't try to execute the second set statement more than we should (which would terminate early).
You need to do two things differently if you want first/last rows treated differently. Here I assume prev_amount is 0 if it's the first row, and I assume last_amount is missing, meaning the last row just gets the last prev_amount repeated, while the first row is averaged between 0 and the next_amount. You can treat either one differently if you choose, I don't know your data.
data have;
input Id Amount;
datalines;
1 10
2 .
3 20
4 30
5 .
6 .
7 40
;;;;
run;
data want;
set have;
by amount notsorted; *so we can tell if we have consecutive missings;
retain prev_amount; *next_amount is auto-retained;
if not missing(amount ) then prev_amount=amount;
else if _n_=1 then prev_amount=0; *or whatever you want to treat the first row as;
else if first.amount then do;
do until ((next_id > id and not missing(next_amount)) or (eof));
set have(rename=(id=next_id amount=next_amount)) end=eof;
end;
amount = mean(prev_amount,next_amount);
end;
else amount = mean(prev_amount,next_amount);
run;
Sorry for the confusing title.
Background
data looks like this
Area Date Ind LB UB
A 1mar 14 1 20
A 2mar 3 1 20
B 1mar 11 7 22
B 2mar 0 7 22
Area has several distinct values. For each area, LB and UB are fixed across multiple dates, while Ind varies. Date always starts from month start to certain day of the month.
Target
My target is to run a control chart for each area to see if Ind exceeds the range (LB,UB).
But if I just plot the raw data for each area, the xaxis by default not ends at the last day of the month (In the previous example, the plot will be from 1-Mar to 2-Mar instead of 31-Mar. I do know the by specifying the xmax option in xaxis the plot will extends to 31-Mar. But this only extends the xaxis, LB and UB still display from 1-Mar to 2-Mar, leaving the right side of the graph empty.
Thus I use modify to add in some date records.
What I have done
data have;
modify have;
do i = 0 to intck('day',today(),intnx('month',today(),0,'E'));
Date = today()+i;
call missing(Ind);
output;
end;
stop;
run;
proc sgplot data=have missing;
series ... Ind ...;
series ... LB ...;
series ... UB ...;
run;
Question
But this only works for one area. I need to modify each area first then plot them one by one. How can I relatively efficient to get below data
Area Date Ind LB UB
A 1mar 14 1 20
A 2mar 3 1 20
A 3mar . 1 20
....
A 31mar. 1 20
B 1mar 11 7 22
B 2mar 0 7 22
B 3mar . 7 22
....
B 31mar. 7 22
Or there's other options in proc sgplot to solve this?
You can use proc timeseries with the by-group area to get it into the form that you need. The end= option will let you specify an ending date for your data. It looks like you're using the current month, so we'll take your intnx function and plop it into a set of macro functions that resolve to a date literal (most ETS procs require a date literal for some reason).
We'll use two var statements: one for ind where we fill in unobserved values with ., and another for LB & UB to set their unobserved values with the previous valid value.
Note that we are assuming you've already put date into a SAS date. Make sure you do this first before running the below code.
proc timeseries data=have
out=want;
by area;
id Date interval=day notsorted
accumulate=none
end="%sysfunc(intnx(month, %sysfunc(today() ), 0, E), date9.)"d;
var Ind / setmissing=missing;
var LB UB / setmissing=previous;
run;
Your final dataset will look exactly as you'd like.
To my disappointment, the following code, which sums up 'value' by week from 'master' for weeks which appear in 'transaction' does not work -
data master;
input week value;
datalines;
1 10
1 20
1 30
2 40
2 40
2 50
3 15
3 25
3 35
;
run;
data transaction;
input change_week ;
datalines;
1
3
;
run;
data _null_;
set transaction;
do until(done);
set master end=done;
where week=change_week;
sum = sum(value, sum);
end;
file print;
put week= sum=;
run;
SAS complains, rightly, because it doesn't see 'change_week' in master and does not know how to operate on it.
Surely there must be a way of doing some operation on a subset of a master set (of course, suitably indexed), given a transaction dataset... Does any one know?
I believe this is the closest answer to what the asker has requested.
This method uses an index on week on the large dataset, allowing for the possibility of invalid week values in the transaction dataset, and without requiring either dataset to be sorted in any particular order. Performance will probably be better if the master dataset is in week order.
For small transaction datasets, this should perform quite a lot better than the other solutions as it only retrieves the required observations from the master dataset. If you're dealing with > ~30% of the records in the master dataset in a single transaction dataset, Quentin's method may sometimes perform better due to the overhead of using the index.
data master(index = (week));
input week value;
datalines;
1 10
1 20
1 30
2 40
2 40
2 50
3 15
3 25
3 35
;
run;
data transaction;
input week ;
datalines;
1
3
4
;
run;
data _null_;
set transaction;
file print;
do until(done);
set master key = week end=done;
/*Prevent implicit retain from previous row if the key isn't found,
or we've read past the last record for the current key*/
if _IORC_ ne 0 then do;
_ERROR_ = 0;
call missing(value);
end;
else sum = sum(value, sum);
end;
put week= sum=;
run;
N.B. for this to work, the indexed variable in the master dataset must have exactly the same name and type as the variable in the transaction dataset. Also, the index must be of the non-unique variety in order to accommodate multiple rows with the same key value.
Also, it is possible to replace the set master... statement with an equivalent modify master... statement if you want to apply transactional changes directly, i.e. without SAS making a massive temp file and replacing the original.
You are correct, there are many ways to do this in SAS. Your example is inefficient because (once we got it working) it would still require a full read of "master" for ever line of "transaction".
(The reason you got the error was because you used where instead of if. In SAS, the sub-setting where in a data step is only aware of columns already existing within the data set it's sub-setting. They keep two options because there where is faster when it's usable.)
An alternative solution would be use proc sql. Hopefully this example is self-explanatory:
proc sql;
select
a.change_week,
sum(b.value) as value
from
transaction as a,
master as b
where a.change_week = b.week
group by change_week;
quit;
I don't suggest below solution (would like #Jeff's SQL solution or even a hash better). But just for playing with data step logic, I think below approach would work, if you trust that every key in transaction will exist in master. It relies on the fact that both datasets are sorted, so only makes one pass of each dataset.
On first iteration of the DATA step, it reads the first record from the transaction dataset, then keeps reading through the master dataset until it finds all the matching records for that key, then the DATA step loop iterates and it does it again for the next transaction record.
1003 data _null_;
1004 set transaction;
1005 by change_week;
1006
1007 do until(last.week and _found);
1008 set master;
1009 by week;
1010
1011 if week=change_week then do;
1012 sum = sum(value, sum);
1013 _found=1;
1014 end;
1015 end;
1016
1017 *file print;
1018 put week= sum= ;
1019 run;
week=1 sum=60
week=3 sum=75