Say that I have the following database:
Min Rank Qty
2 1 100
2 2 90
2 3 80
2 4 70
5 1 110
5 2 100
5 3 90
5 4 80
5 5 70
7 1 120
7 2 110
7 3 100
7 4 90
I need to have the database with the continuous values for minutes like this:
Min Rank Qty
2 1 100
2 2 90
2 3 80
2 4 70
3 1 100
3 2 90
3 3 80
3 4 70
4 1 100
4 2 90
4 3 80
4 4 70
5 1 110
5 2 100
5 3 90
5 4 80
5 5 70
6 1 110
6 2 100
6 3 90
6 4 80
6 5 70
7 1 120
7 2 110
7 3 100
7 4 90
How can I do this in SAS? I just need to replicate the previous minute. The number of observations per minute varies...it can be 4 or 5 or more.
It is not that hard to imagine code that would do this, the problem is that it quickly starts to look messy.
If your dataset is not too large, one approach you could consider the following approach:
/* We find all gaps. the output dataset is a mapping: the data of which minute (reference_minute) do we need to create each minute of data*/
data MINUTE_MAPPING (keep=current_minute reference_minute);
set YOUR_DATA;
by min;
retain last_minute 2; *set to the first minute you have;
if _N_ NE 1 and first.min then do;
/* Find gaps, map them to the last minute of data we have*/
if last_minute+1 < min then do;
do current_minute=last_minute+1 to min-1;
reference_minute=last_minute;
output;
end;
end;
/* For the available data, we map the minute to itself*/
reference_minute=min;
current_minute=min;
output;
*update;
last_minute=min;
end;
run;
/* Now we apply our mapping to the data */
*you must use proc sql because it is a many-to-many join, data step merge would give a different outcome;
proc sql;
create table RESULT as
select YD.current_minute as min, YD.rank, YD.qty
MINUTE_MAPPING as MM
join YOUR_DATA as YD
on (MM.reference_minute=YD.min)
;
quit;
The more performant approach would involve trickery with arrays.
But i find this approach a bit more appealing (disclaimer: at first thought), it is quicker to grasp (disclaimer again: imho) for someone else afterwards.
For good measure, the array approach:
data RESULT (keep=min rank qty);
set YOUR_DATA;
by min;
retain last_minute; *assume that first record really is first minute;
array last_data{5} _TEMPORARY_;
if _N_ NE 1 and first.min and last_minute+1 < min then do; *gap found;
do current_min=last_minute+1 to min-1;
*store data of current record;
curr_min=min;
curr_rank=rank;
curr_qty=qty;
*produce records from array with last available data;
do iter=1 to 5;
min = current_minute;
rank = iter;
qty = last_data{iter};
if qty NE . then output; *to prevent output of 5th element where there are only 4;
end;
*put back values of actual current record before proceeding;
min=curr_min;
rank=curr_rank;
qty=curr_qty;
end;
*update;
last_minute=min;
end;
*insert data for use on later missing minutes;
last_data{rank}=qty;
if last.min and rank<5 then last_data{5}=.;
output; *output actual current data point;
run;
Hope it helps.
Note, currently no access to a SAS client where i am. So untested code, might contain a couple of typo's.
Unless you have an absurd number of observations, I think transposing would make this easy.
I don't have access to sas at the moment so bear with me (I can test it out tomorrow if you can't get it working).
proc transpose data=data out=data_wide prefix=obs_;
by minute;
id rank;
var qty;
run;
*sort backwards so you can use lag() to fill in the next minute;
proc sort data=data_wide;
by descending minute;
run;
data data_wide; set data_wide;
nextminute = lag(minute);
run;
proc sort data=data_wide;
by minute;
run;
*output until you get to the next minute;
data data_wide; set data_wide;
*ensure that the last observation is output;
if nextminute = . then output;
do until (minute ge nextminute);
output;
minute+1;
end;
run;
*then you probably want to reverse the transpose;
proc transpose data=data_wide(drop=nextminute)
out=data_narrow(rename=(col1=qty));
by minute;
var _numeric_;
run;
*clean up the observation number;
data data_narrow(drop=_NAME_); set data_narrow;
rank = substr(_NAME_,5)*1;
run;
Again, I can't test this now, but it should work.
Someone else may have a clever solution that makes it so you don't have to reverse-sort/lag/forward-sort. I feel like I have dealt with this before but the obvious solution for me right now is to have it sorted backwards at whatever prior sort you do (you can do the transpose with a descending sort no problem) to save you an extra sort.
Related
In this data, I need to subset by each variable by certain percentage.
For example,
Obs Group Score
1 A 1
2 A 2
3 B 1
4 B 1
5 C 3
6 C 1
7 C 1
8 A 1
9 A 3
10 A 1
11 A 2
12 B 3
13 C 2
I would need to subset 10 obs.
The sample must consist of all groups, and score of 1 takes higher priority.
Each group is given certain percent.
Let say 50% for A, 20% for B and 30% for C.
I tried using proc surveyselect but it failed. The number of alloc is not same as the strata.
proc surveyselect data=example out=test sampsize=10;
strata group score/alloc=(0.5 0.2 0.3);
run;
I don't know proc surveyselect too much, so I give the data step version.
data have;
input Obs Group$ Score;
cards;
1 A 1
2 A 2
3 B 1
4 B 1
5 C 3
6 C 1
7 C 1
8 A 1
9 A 3
10 A 1
11 A 2
12 B 3
13 C 2
;
run;
proc sort;
by Group Score;
run;
data want;
array _Dist_[3]$ _temporary_('A','B','C');
array _Upper_[3] _temporary_(5,2,3);
array _Count_[3] _temporary_;
do i = 1 to rec;
set have nobs=rec point=i;
do j = 1 to dim(_Dist_);
_Count_[j] + (Group=_Dist_[j]);
if _Count_[j] <= _Upper_[j] and Group = _Dist_[j] then output;
end;
end;
stop;
drop j;
run;
I have 70 databases of different sizes (same number of columns, different numbers of lines).
I need to get the 25% higher values and the 25% lower values considering a given column VAR1.
I have:
id VAR1
1 10
2 -5
3 -12
4 7
5 12
6 7
7 -9
8 -24
9 0
10 6
11 -18
12 22
Sorting by VAR1, I need to select the rows (all columns) containing the 3 smallest and the 3 largest (25% from each extreme), i.e.,
id VAR1
8 -24
11 -18
3 -12
7 -9
2 -5
9 0
10 6
4 7
6 7
1 10
5 12
12 22
I need to keep in the database the rows (all columns) that contain the VAR1 equal to -24, -18, -12, 10, 12 and 22.
id VAR1
8 -24
11 -18
3 -12
1 10
5 12
12 22
What I’ve been thinking:
Order column VAR1 in ascending order;
Create a numbered column from 1 to N (n=_N_) - in this case, N=12;
I do a=N*0.25 (to have the value that represents 25%);
I do b=N-a (to have the value that represents the "last" 25%).
So, I can use keep:
if N<a.... I will have the first 25% (the smallest).
if N>b.... I will have the last 25% (the largest).
I can calculate a and b.
But I’m not getting the maximum value of N in this case 12.
I will repeat this for the 70 database, I would not like to have to enter this maximum value every time (it varies from one database to another).
I need help to "fix" the maximum value (N) without having to type it (even if it is repeated in all the lines of another "auxiliary column").
Or if there’s some better way to get those 25% from each end.
My code:
proc sort data=have; by VAR1; run;
data want; set have;
seq=_N_;
N=max(seq); *N=max. value of lines. (I stopped here and don’t know if below is right);
a=N*0.25;
b=N-b;
if N<a;
if N>b;
run;
Thank you very much!
Proc RANK computes percentiles that you can use to select the desired rows.
Example:
data have1 have2 have3 have4 have5;
do id = 1 to 100;
X = ceil(rand('normal', 0, 10));
if id < 60 then output have1;
if id < 70 then output have2;
if id < 80 then output have3;
if id < 90 then output have4;
if id < 100 then output have5;
end;
run;
proc rank data=have1 percent out=want1(where=(pct not between 25 and 75)) ;
var x;
ranks pct;
run;
proc rank data=have2 percent out=want2(where=(pct not between 25 and 75)) ;
var x;
ranks pct;
run;
proc rank data=have3 percent out=want3(where=(pct not between 25 and 75)) ;
var x;
ranks pct;
run;
I try to construct Table 2 by writing below SAS code but what I get is the Table 1. I could not figure out what I missed. Help very appreciated Thank you.
&counter = 4
data new;set set1;
total = 0;
a = 1;
do i = 1 to &counter;
call symputX('a',a);
total = total + Tem_&a.;
a = symget('a')+1;
call symputX('a',a);
end;
run;
Table 1
ID Amt Tem_1 Tem_2 Tem_3 Tem_4 total
4 500 1 4 5 900 3600
5 200 50 100 200 0 0
9 50 40 0 0 0 0
10 500 70 100 250 0 0
Table 2
ID Amt Tem_1 Tem_2 Tem_3 Tem_4 total
4 500 1 4 5 900 910
5 200 50 100 200 0 350
9 50 40 0 0 0 40
10 500 70 100 250 0 420
You cannot use SYMPUT and SYMGET that way, unfortunately. While you can use them to store/retrieve macro variable values, you cannot change the code sent to the compiler after execution.
Basically, SAS has to figure out the machine code for what it's supposed to do on every iteration of the data step loop before it looks at any data (this is called compiling). So the problem is, you can't define tem_&a. and expect to be allowed to change what _&a. is during execution, because it would change what that machine code needs to do, and SAS couldn't prepare for that sufficiently.
So, what you wrote the &a. would be resolved when the program compiled, and whatever value &a. had before your data step woudl be what tem_&a. would turn into. Presumably the first time you ran this it errored (&a. does not resolve and then an error about & being illegal in variable names), and then eventually the call symput did its job and &a got a 4 in it at the end of the loop, and forever more your tem_&a. resolved to tem_4.
The solution? Don't use macros for this. Instead, use arrays.
data new;
set set1;
total = 0;
array tem[&counter.] tem_1-tem_&counter.;
a = 1;
do i = 1 to &counter; *or do i = 1 to dim(tem);
total = total + Tem[i];
end;
run;
Or, of course, just directly sum them.
data new;
set set1;
total = sum(of tem_1-tem_4);
run;
If you REALLY like macro variables, you could of course do this in a macro do loop, though this is not recommended for this purpose as it's really better to stick with data step techniques. But this should work, anyway, if you run this inside a macro (this won't be valid in open code).
data new;
set set1;
total = 0;
%do i = 1 %to &counter;
total = total + Tem_&i.;
%end;
run;
Is it possible to merge below two tables using hash object in SAS 9.1 example below ? The main problemseems to be creation of Value variable w Result dataset. Problem is that each payment could pay for more than one charge, and sometimes more than one payment is need to pay for one charge and this tho cases could appear simultaneously. Does it problem has some general name ?
http://support.sas.com/rnd/base/datastep/dot/hash-getting-started.pdf
data TABLE1;
input ID_client ID_commodity Charge;
datalines;
1 111111111 100
1 222222222 200
2 333333333 300
2 444444444 400
2 555555555 500
;;;;
run;
data TABLE2;
input ID_client_hash ID_ofpayment paymentValue;
datalines;
1 11 50
1 12 50
1 13 100
1 14 50
1 15 50
2 21 500
2 22 200
2 23 100
2 24 200
2 25 200
;;;;
run;
data OUT;
input ID_client ID_commodity ID_ofpayment value;
datalines;
1 111111111 11 50
1 111111111 12 50
1 222222222 13 100
1 222222222 14 50
1 222222222 15 50
2 333333333 21 300
2 444444444 21 200
2 444444444 22 200
2 555555555 23 100
2 555555555 24 200
2 555555555 25 200
This might work for you - I have 9.2 and 9.2 has some significant hash improvements, but I think I behaved myself and only used what was there in 9.1. You might try crossposting this to SAS-L [SAS listserv] as Paul Dorfman (ie, The Hash Guru) reads that still I believe.
I assumed you want the 'leftovers' posted out. You may need to work on that part, if it's not working the way you want. This isn't terribly well tested, it works for your example dataset. I call missing the commodity for 24 and 25 since they're not used for that.
I'm pretty sure there's a more clean way to do the iteration than what I do, but since 9.2+ is what I use and we have multidata available, i've always used that instead of hash iterators so I don't know the cleaner methods.
data have;
input ID_client ID_commodity Charge;
datalines;
1 111111111 100
1 222222222 200
2 333333333 300
2 444444444 400
2 555555555 50
;;;;
run;
data for_hash;
input ID_client_hash ID_ofpayment paymentValue;
datalines;
1 11 50
1 12 50
1 13 100
1 14 50
1 15 50
2 21 500
2 22 200
2 23 100
2 24 200
2 25 200
;;;;
run;
data want;
*Create hash and hash iterator - must use iterator since 9.1 does not allow multidata option;
if _n_ = 1 then do;
format id_client_hash paymentValue id_ofpayment BEST12.;
declare hash h(dataset:'for_hash' , ordered: 'a');
h.defineKey('ID_client_hash','id_ofpayment'); *note I put id_client_hash, renaming the id - want to be able to compare them;
h.defineData('id_client_hash','id_ofpayment','paymentValue');
call missing(id_ofpayment,paymentValue, id_client_hash);
h.defineDone();
declare hiter hi('h');
end;
do _t = 1 by 1 until (last.id_client);
set have;
by id_client;
*Iterate through the hash and find the first record with the same ID_client;
do rc = hi.first() by 0 while (rc eq 0 and ID_client ne ID_client_hash);
rc = hi.next();
end;
*For the current charge record, iterate through the payment (hash) until all paid up.;
do while (charge gt 0 and rc eq 0 and ID_client=ID_client_hash);
if charge ge paymentValue then do; *If charge >= paymentvalue, use up the payment value;
value = paymentValue; *so whole paymentValue is value;
charge = charge - paymentValue; *charge is decremented by paymentValue;
output; *output row;
_id=ID_client_hash;
_pay=id_ofpayment;
rc = hi.next();
h.remove(key:_id,key:_pay); *remove payment row from hash now that it has been used up;
end;
else do; *this is if (remaining) charge is less than payment - we will not use all of the payment;
value = charge; *value is the remainder of the charge, ie, how much of payment was actually used;
paymentValue = paymentValue - charge; *paymentValue is the remainder of paymentValue;
charge= 0; *charge is zero now;
output; *output a row;
h.replace(); *replace paymentValue in the hash with the new value of paymentValue, minus charge;
end;
end; *end of iteration through hash - at this point, either charge = 0 or we have run out of payments with that ID;
if charge gt 0 then do;
value=-1*charge;
call missing(id_ofpayment);
output; *output a row for the charge, which is not paid;
end;
if last.id_client then do; *this is cleanup, checking to see if we have any leftover payments;
do while (rc=0); *iterate through the remaining hash;
do rc = hi.first() by 0 while (rc eq 0 and ID_client ne ID_client_hash);
rc = hi.next();
end;
if rc=0 then do;
call missing(id_commodity); *to make it clear this is a leftover payment;
value=paymentValue; *update the value;
output; *output the payment;
_id=ID_client_hash;
_pay=id_ofpayment;
rc = hi.next();
if rc= 0 then h.remove(key:_id,key:_pay); *remove the payment just output;
end;
end;
end;
end;
keep id_client id_ofpayment id_commodity value;
run;
Among other things, this isn't terribly fast - I do a lot of iterating that might be wasteful. It will be relatively faster if you don't have any payment ID_client records that aren't represented in the charge records- any that you do are getting skipped over, so that could end up super slow.
I'm not confident hash is the superior solution, at least pre-9.2; keyed UPDATE might be superior. UPDATE is pretty much made for transactional database structures, which this seems close to.
How can we do iteration in a sas dataset.
For example I have chosen the first. of a variable.
And want to find the occurence of a particular condition and set a value when it satisfy
SAS data step has a built-in loop over observations. You don't have to do any thing, unless you want to, for some reason. For instance, the following generates a random number for each observation:
data one;
set sashelp.class;
rannum = ranuni(0);
run;
If you want to loop over variables, then there are arrays. For example, the following initializes variables, var1 to var10, with random numbers:
data one;
array vars[1:10] var1-var10;
do i = 1 to 10;
vars[i] = ranuni(0);
end;
run;
The first. and last. flags are automatically generated when you set a (sorted) data with a by statement. An example:
proc sort data=sashelp.class out=class;
by age;
run;
data one;
set class;
by age;
first = first.age;
last = last.age;
run;
/* check */
proc print data=one;
run;
/* on lst
Obs Name Age first last
1 Joyce 11 1 0
2 Thomas 11 0 1
3 James 12 1 0
4 Jane 12 0 0
5 John 12 0 0
6 Louise 12 0 0
7 Robert 12 0 1
8 Alice 13 1 0
...
18 William 15 0 1
19 Philip 16 1 1
*/