Here's a very similar question
My question is a bit different from the one in the above link.
Background
I have a data set contains hourly data. So each object has 24 records per day. Now I want to create K new columns represents next 1,2,...K hourly records for each object. If not exist, replace them with missing values.
K is dynamic and is defined by users.
The original order must be preserved. No matter it's guaranteed in the data steps or by using sorting in the end.
I'm looking for an efficient way to achieve this.
Example
Original data:
Object Hour Value
A 1 2.3
A 2 2.3
A 3 4.0
A 4 1.3
Given K = 2, desired output is
Object Hour Value Value1 Value2
A 1 2.3 2.3 4.0
A 2 2.3 4.0 1.3
A 3 4.0 1.3 .
A 4 1.3 . .
Possible solutions
sort in reverse order -> obtain previous k records -> sort them back.
When the no. of observation is large, this shouldn't be an ideal way.
proc expand. I don't familiar with it cause it's never licensed on my pc.
Using point in data step.
retain statement inside data step. I'm not sure how this works.
Assuming this is provided as a macro variable, this is pretty easily done with a side to side merge-ahead. Certainly faster than a transpose for K much larger than the total record count, and probably faster than looping POINTs.
Basically you merge the original dataset to itself, and use FIRSTOBS to push the starting point down one for each successive merge iteration. This needs a bit of extra work if you have BY groups that need protecting, but that's usually not too hard to manage.
Here's an example using SASHELP.CLASS:
%let K=5;
%macro makemergesets(k=, datain=, varin=, keepin=);
%do _i = 2 %to &k;
&datain (firstobs=&_i rename=&varin.=&varin._&_i. keep=&keepin. &varin.)
%end;
%mend makemregesets;
data class_all;
merge sashelp.class
%makemergesets(k=&k,datain=sashelp.class, varin=age,keepin=)
;
run;
You could transpose the hours and then freely access the hours ahead within each object. Just to set the value of K and generate some dummy data:
* Assign K ;
%let K=3 ;
%let Kn=value&k;
* Generate test objects each containing 24 hourly records ;
data time ;
do object=1 to 10 ;
do hour=1 to 24 ;
value=round(ranuni(1)*10,0.1) ;
output ;
end ;
end ;
run ;
EDIT: I updated the below step as realised the transpose isn't needed. Doing it all in one step gives ~20% improvement in CPU time
Use an array of the 24 hour values and loop through do i=1 to &k for each hour:
* Populate K variables ;
data output(keep=object hour value value1-&kn ) ;
set time ;
by object ;
retain k1-k24 . ;
array k(2,24) k1-k24 value1-value24 ;
k(1,hour)=value ;
if last.object then do hour=1 to 24 ;
value=k(1,hour) ;
do i=1 to &k ;
if hour+i <=24 then k(2,i)=k(1,hour+i) ;
else k(2,i)=.;
end ;
output ;
end ;
run ;
Related
I want to do a sum of 250 previous rows for each row, starting from the row 250th.
X= lag1(VWRETD)+ lag2(VWRETD)+ ... +lag250(VWRETD)
X = sum ( lag1(VWRETD), lag2(VWRETD), ... ,lag250(VWRETD) )
I try to use lag function, but it does not work for too many lags.
I also want to calculate sum of 250 next rows after each row.
What you're looking for is a moving sum both forwards and backwards where the sum is missing until that 250th observation. The easiest way to do this is with PROC EXPAND.
Sample data:
data have;
do MKDate = '01JAN1993'd to '31DEC2000'd;
VWRET = rand('uniform');
output;
end;
format MKDate mmddyy10.;
run;
Code:
proc expand data=have out=want;
id MKDate;
convert VWRET = x_backwards_250 / transform=(movsum 250 trimleft 250);
convert VWRET = x_forwards_250 / transform=(reverse movsum 250 trimleft 250 reverse);
run;
Here's what the transformation operations are doing:
Creating a backwards moving sum of 250 observations, then setting the initial 250 to missing.
Reversing VWRET, creating a moving sum of 250 observations, setting the initial 250 to missing, then reversing it again. This effectively creates a forward moving sum.
The key is how to read observations from previous and post rows. As for your sum(n1, n2,...,nx) function, you can replace it with iterative summation.
This example uses multiple set skill to achieve summing a variable from 25 previous and post rows:
data test;
set sashelp.air nobs=nobs;
if 25<_n_<nobs-25+1 then do;
do i=_n_-25 to _n_-1;
set sashelp.air(keep=air rename=air=pre_air) point=i;
sum_pre=sum(sum_pre,pre_air);
end;
do j=_n_+1 to _n_+25;
set sashelp.air(keep=air rename=air=post_air) point=j;
sum_post=sum(sum_post,post_air);
end;
end;
drop pre_air post_air;
run;
Only 26th to nobs-25th rows will be calculated, where nobs stands for number of observations of the setting data sashelp.air.
Multiple set may take long time when meeting big dataset, if you want to be more effective, you can use array and DOW-loop to instead multiple set skill:
data test;
array _val_[1024]_temporary_;
if _n_=1 then do i=1 by 1 until(eof);
set sashelp.air end=eof;
_val_[i]=air;
end;
set sashelp.air nobs=nobs;
if 25<_n_<nobs-25+1 then do;
do i=_n_-25 to _n_-1;
sum_pre=sum(sum_pre,_val_[i]);
end;
do j=_n_+1 to _n_+25;
sum_post=sum(sum_post,_val_[j]);
end;
end;
drop i j;
run;
The weakness is you have to give a dimension number to array, it should be equal or great than nobs.
These skills are from a concept called "Table Look-Up", For SAS context, read "Table Look-Up by Direct Addressing: Key-Indexing -- Bitmapping -- Hashing", Paul Dorfman, SUGI 26.
You don't want use normal arithmetic with missing values becasue then the result is always a missing value. Use the SUM() function instead.
You don't need to spell out all of the lags. Just keep a normal running sum but add the wrinkle of removing the last one in by subtraction. So your equation only needs to reference the one lagged value.
Here is a simple example using running sum of 5 using SASHELP.CLASS data as an example:
%let n=5 ;
data step1;
set sashelp.class(keep=name age);
retain running_sum ;
running_sum=sum(running_sum,age,-(sum(0,lag&n.(age))));
if _n_ >= &n then want=running_sum;
run;
So the sum of the first 5 observations is 68. But for the next observation the sum goes down to 66 since the age on the 6th observation is 2 less than the age on the first observation.
To calculate the other variable sort the dataset in descending order and use the same logic to make another variable.
I would like to create one macro called 'currency_rate' which calls the correct value depending on the conditions stipulated
(either 'new_rate' or static value of 1.5):
%macro MONEY;
%Do i=1 %to 5;
data get_currency_&i (keep=month code new_rate currency_rate);
set Table1;
if month = &i and code = 'USD' then currency_rate=new_rate;
else currency_rate=1.5;
run;
data _null_;
set get_currency_&i;
if month = &i and code = 'USD' then currency_rate=new_rate;
else currency_rate=1.5;
call symput ('currency_rate', ???);
run;
%End;
%mend MONEY;
%MONEY
I am happy with the do loop and first data step. It is the call symput I am stuck on. Is call symput the correct function to use, to assign two possible values to one macro?
A snippet example of the way I will be using 'currency_rate' in a proc sql:
t1.income/¤cy_rate.
I am a beginner level SAS user, any guidance would be great!
Thanks
Let's simulate your case. Suppose we have 3 datasets, as shown below -
data get_currency_1;
input month code $ new_rate currency_rate;
cards;
1 USD 2 2
2 CHF 2 1.5
3 GBP 1 1.5
;
data get_currency_2;
input month code $ new_rate currency_rate;
cards;
1 USD 3 1.5
2 USD 4 4
3 JPY 0.5 1.5
;
data get_currency_3;
input month code $ new_rate currency_rate;
cards;
1 USD 1 1.5
2 USD 3 1.5
3 USD 2.5 2.5
;
Now, let's run your code where we assign a value to currency_rate.
Let i=1 So, the dataset get_currency_1 will be accessed. As we run the step, each and every row will be accessed and the value of currency_rate will be assigned to the macro variable currency_rate and this iteration will continue till the end of the data step. At this time, the last value will be of currency_rate will be the final value of macro variable currency_rate because beyond that the step ends.
%let i=1; /*Let's assign 1 to i*/
data _null_;
set get_currency_&i;
if month = &i and code = 'USD' then currency_rate=new_rate;
else currency_rate=1.5;
call symput ('currency_rate', currency_rate);
run;
%put Currency rate is: ¤cy_rate;
Currency rate is: 1.5
Let i=3:
%let i=3; /*Let's assign 3 to i*/
data _null_;
set get_currency_&i;
if month = &i and code = 'USD' then currency_rate=new_rate;
else currency_rate=1.5;
call symput ('currency_rate', currency_rate);
run;
%put Currency rate is: ¤cy_rate;
Currency rate is: 2.5
You cannot have multiple values on one macro variable.
You say you are a beginner, so the best course of action is to avoid macro programming at this point. You would be better served learning about where, merge (or join) and by statements.
You state you will need to use a currency_rate in a statement such as
t1.income / ¤cy_rate.
The t1. to me suggests t1 is an alias in a SQL join and thus the far more likely scenario is that you need to left join table t1 that contains incomes with table1 (call it monthly_datum) that contains the monthly currency rates.
select
t1.income / coaslesce(monthly_datum.currency_rates,1.5)
, …
from
income_data as t1
left join
monthly_datum
on t1.month = monthly_datum.month
The rate of 1.5 would be used when the income is associated with a month that is not present in monthly_datum.
A macro variable can only hold a single value.
Since you're only ever assigning a single value, you can easily use CALL SYMPUTX().
call symputx('currency_rate', currency_rate);
But if your data has more than one row, then the value will be the last value set in the data set.
Can someone explain this code to me in depth?? I have a list of comments in the code where I am confused. Is there anyway I can attach a csv of the data? Thanks in advance.
data have;
infile "&sasforum.\datasets\Returns.csv" firstobs=2 dsd truncover;
input DATE :mmddyy10. A B B_changed;
format date yymmdd10.;
run;
data spread;
do nb = 1 by 1 until(not missing(B));
set have;
end;
br = B;
do i = 1 to nb;
set have; *** I don't get how you can do i = 1 to nb with set have. There is not variable nb on set have. The variable nb is readinto the dataset spread;
if nb > 1 then B_spread = (1+br)**(1/nb) - 1;
else B_spread = B;
output;
end;
drop nb i br;
run;
***** If i comment out "drop nb i br" i get to see that nb takes a value of 2 for the null values of B.. I don't get how this is done or possible. Because if I run the code right after the line "br = B", and put an output statement in the first do loop, I am clearly seeing that nb takes a valueof one for B null values.Honestly, It is like the first do loop is reads in future observations for B as BR. Can you please explain this to me. The second dataset "bunch" seems to follow the same type of principles as the first... So i imagine if I get a grasp on the first on how the datasetspread is created, then I will understand how bunch is created.;
This is an advanced DATA step programming technique, commonly referred to as a DoW loop. If you search lexjansen.com for DoW, you will find helpful papers like http://support.sas.com/resources/papers/proceedings09/038-2009.pdf. The DoW loop codes and explicit loop around a SET statement. This is actually a "Double-DoW loop", because you have two explicit loops.
I made some sample data, and added some PUT statements to your code:
data have ;
input B ;
cards ;
.
.
1
2
.
.
.
3
;
data spread;
do nb = 1 by 1 until(not missing(B));
set have;
put _n_= "top do-loop " (nb B)(=) ;
end;
br = B;
do i = 1 to nb;
set have;
if nb > 1 then B_spread = (1+br)**(1/nb) - 1;
else B_spread = B;
output;
put _n_= "bottom do-loop " (nb B br B_spread)(=) ;
end;
drop nb i br;
run;
With that sample data, on the first iteration of the DATA step (N=1), the top do loop will iterate three times, reading the first three records of HAVE. At that point, (not missing(B)) will be true, and the loop will not iterate again. The variable NB will have a value of 3. The bottom loop will then iterate 3 times, because NB has a value of 3. It will also read the first three records have HAVE. It will compute B_Spread, and output each record.
On the second iteration of the DATA step, the top DO loop will iterate only once. It will read the 4th record, with B=2. The bottom loop will iterate once, reading the 4th record, computing B_spread, and output.
On the third iteration of the DATA step, the top DO loop will iterate four times, reading the 5th through 8th records. The bottom loop will also iterate four times, reading the 5th through 8th records, computing B_spread, and output.
On the fourth iteration of the DATA step, the step to complete, because the SET statement in the top loop will read the End Of File mark.
The core concept of a Double-DoW loop is that typically you are reading the data in groups. Often groups are identified by an ID. Here they are defined by sequential records read until not missing(B). The top DO-loop reads the first group of records, and computes some value (in this case, it computes NB, the number of records in the group). Then the bottom DO-loop reads the first group of records, and computes some new value, using the value computed in top DO-loop. In this case, the bottom DO-loop computes B_spread, using NB.
I'm trying to figure out how to do the following in SAS. I have data taken by an NO2 sensor every minute. A GPS is recording a location at give or take every second. I need to assign the value of each data recording for every minute to the previous seconds of that past minute. The NO2 data recorded is an average of the previous minute.
Here's a sample of my data: Sample Data
I am looking to bring the data from the last line (NO2, Humidity, Temperature) "up" to seconds of the previous minute which have a GPS reading. The column is in DateTime format.
Would love any pointers on how to do this... Thanks in advance!
Your question is a bit unclear unfortunately. Assuming that you want to retrospectively assign NO2, Humidity and Temperature for all missing values within the last minute (60 seconds) you can do the following:
Dummy data:
data input ;
format datetime datetime20. ;
datetime='28SEP2015:07:21:26'dt ;
do i=1 to 120 ;
datetime+1 ;
if i=80 then do ; no2=0.007 ; humidite=55.9 ; temperature=22.4 ; end ;
else if i=120 then do ; no2=0.020 ; humidite=65.0 ; temperature=23.5 ; end ;
else call missing(no2, humidite,temperature) ;
output ;
end ;
run ;
Solution:
Keep only the key records that contain values:
data key(rename=(datetime=keytime)) ;
set input(where=(nmiss(no2,humidite,temperature) ne 3));
run ;
Run the key records over the original data as a hashtable:
data output(drop=rc rd i keytime) ;
*Load key table into memory ;
if _n_=1 then do ;
declare hash pt(dataset:"key",multidata:"yes",ordered:"yes");
declare hiter iter('pt');
rc=pt.defineKey('id');
rc=pt.defineData('keytime','no2','humidite','temperature');
rc=pt.defineDone();
end ;
*Read in original data ;
set input ;
rc=pt.find() ;
*Update with key table values whenever needed ;
if rc=0 then do ;
rd=iter.first() ;
if datetime gt keytime then rd=iter.next();
if rd=0 and keytime-60 <= datetime <= keytime then output ;
else do ;
call missing(no2,humidite,temperature);
output ;
end;
end ;
run;
It sounds like you have two data sources. One generates an observation every minute. The other generates an observation as often as every second. You want to join the two datasets together such that all of the frequently-sampled values for a within a given minute get the same occasionally-sampled value that corresponds to that minute.
Assuming that the NO2 samples really are every minute and have a timedate value of NO2_TD, and the GPS timedate stamp is "GPS_TD" [having a variable named the same as a function, informat, and format is generally a bad idea]. One of the best ways to join tables is through SQL. Doing a FULL OUTER JOIN guarantees that if we have missing data on either side, we still get the observations from the other data source.
PROC SQL ;
CREATE TABLE joined_data AS
SELECT a.*, b.NO2_MSR, b.NO2_TD
FROM gps_data a
FULL OUTER JOIN
no2_data b
ON INTNX('DTMINUTE',a.GPS_TD,0,'B') EQ INTNX('DTMINUTE',b.NO2_DT,0,'B')
;
QUIT ;
Here we use INTNX to essentially ignore the seconds. We shift the datetime values to a minute boundary (DTMINUTE), by 0 minutes, to the Beginning of the minute, essentially truncating the seconds for the purposes of the join. INTNX is very commonly used to get the first of the month, last of the month, etc.
If all you need to do it back-fill the "time sorted" environmental data without regard to length of time between gps and environmental data sampling, then the following code pattern (based off Bendy's dummy data creation code) should be sufficient.
/* assuming input data is sorted by datetime value */
/* based on dummy data created with Bendy's code */
data fill_missing;
* read in the data but only keep the gps data - need to add lat and long to keep statement;
set input(keep=i datetime);
if _n_=1 or datetime > datetime2 then do;
drop datetime2; * drop the datetime value associated with the non-gps data ;
* read only the rows that have the non-missing data using this set statement ;
* and only keep the datetime (renamed to datetime2) and the non-gps data ;
set input(keep=datetime no2 humidite temperature
rename=(datetime=datetime2)
where=(no2 ne .));
end;
run;
Also note that this code pattern will stop outputting data when the datetime value of the input data exceeds the max datetime value associated with non-missing environmental data (non-gps). It sounds like this likely isn't a problem for your usage.
Suppose the dataset has 3 columns
Obs Theo Cal
1 20 20
2 21 23
3 21 .
4 22 .
5 21 .
6 23 .
Theo is the theoretical value while Cal is the estimated value.
I need to calculate the missing Cal.
For each Obs, its Cal is a linear combination of previous two Cal values.
Cal(3) = Cal(2) * &coef1 + Cal(1) * &coef2.
Cal(4) = Cal(3) * &coef1 + Cal(2) * &coef2.
But Cal = lag1(Cal) * &coef1 + lag2(Cal) * &coef2 didn't work as I expected.
The problem with using lag is when you use lag1(Cal) you're not getting the last value of Cal that was written to the output dataset, you're getting the last value that was passed to the lag1 function.
It would probably be easier to use a retain as follows:
data want(drop=Cal_l:);
set have;
retain Cal_l1 Cal_l2;
if missing(Cal) then Cal = Cal_l1 * &coef1 + Cal_l2 * &coef2;
Cal_l2 = Cal_l1;
Cal_l1 = Cal;
run;
I would guess you wrote a datastep like so.
data want;
set have;
if missing(cal) then
cal = lag1(cal)*&coef1 + lag2(cal)*&coef2;
run;
LAG isn't grabbing a previous value, but is rather creating a queue that is N long and gives you the end piece of. If you have it behind an IF statement, then you will never put the useful values of CAL into that queue - you'll only be tossing missings into it. See it like so:
data have;
do x=1 to 10;
output;
end;
run;
data want;
set have;
real_lagx = lag(x);
if mod(x,2)=0 then do;
not_lagx = lag(x);
put real_lagx= not_lagx=;
end;
run;
The Real lags are the immediate last value, while the NOT lags are the last even value, because they're inside the IF.
You have two major options here. Use RETAIN to keep track of the last two observations, or use LAG like I did above before the IF statement and then use the lagged values inside the IF statement. There's nothing inherently better or worse with either method; LAG works for what it does as long as you understand it well. RETAIN is often considered 'safer' because it's harder to screw up; it's also easier to watch what you're doing.
data want;
set have;
retain cal1 cal2;
if missing(cal) then cal=cal1*&coef1+cal2*&coef2;
output;
cal2=cal1;
cal1=cal;
run;
or
data want;
set have;
cal1=lag1(cal);
cal2=lag2(cal);
if missing(cal) then cal=cal1*&coef1+cal2*&coef2;
run;
The latter method will only work if cal is infrequently missing - specifically, if it's never missing more than once from any three observations. In the initial example, the first cal (row 3) will be populated, but from there on out it will always be missing. This may or may not be desired; if it's not, use retain.
There might be a way to accomplish it in a DATA step but as for me, when I want SAS to process iteratively, I use PROC IML and a do loop. I named your table SO and succesfully ran the following :
PROC IML;
use SO; /* create a matrix from your table to be used in proc iml */
read all var _all_ into table;
close SO;
Cal=table[,3];
do i=3 to nrow(cal); /* process iteratively the calculations */
if cal[i]=. then do;cal[i]=&coef1.*cal[i-1]+&coef2.*cal[i-2];
end;else do;end;
end;
table[,3]=cal;
Varnames={"Obs" "Theo" "Cal"};
create SO_ok from table [colname=varnames]; /* outputs a new table */
append from table;
close SO_ok;
QUIT;
I'm not saying you couldn't use lag() and a DATA step to achieve what you want to do. But I find that PROC IML is useful and more intuitive when it comes to iterative process.