mean of 10 variables with different starting point (SAS) - sas

I have 18 numerical variables pm25_total2000 to pm25_total2018
Each person have a starting year between 2013 and 2018, we can call that variable "reqyear".
Now I want to calculate mean for each persons 10 years before the starting year.
For example if a person have starting year 2015 I want mean(of pm25_total2006-pm25_total2015)
Or if a person have starting year 2013 I want mean(of pm25_total2004-pm25_total2013)
How to do this?
data _null_;
set scapkon;
reqyear=substr(iCDate,1,4)*1;
call symput('reqy',reqyear);
run;
data scatm;
set scapkon;
/* Medelvärde av 10 år innan rekryteringsår */
pm25means=mean(of pm25_total%eval(&reqy.-9)-pm25_total%eval(&reqy.));
run;
%eval(&reqy.-9) will be constant value (the same value for all as for the first person) , in my case 2007
That doesn't work.

You can compute the mean with a traditional loop.
data want;
set have;
array x x2000-x2018;
call missing(sum, mean, n);
do _n_ = 1 to 10;
v = x ( start - 1999 -_n_ );
if not missing(v) then do;
sum + v;
n + 1;
end;
end;
if n then mean = sum / n;
run;
If you want to flex your SAS skill, you can use POKE and PEEK concepts to copy a fixed length slice (i.e. a fixed number of array elements) of an array to another array and compute the mean of the slice.
Example:
You will need to add sentinel elements and range checks on start to prevent errors when start-10 < 2000.
data have;
length id start x2000-x2018 8;
do id = 1 to 15;
start = 2013 + mod(id,6);
array x x2000-x2018;
do over x;
x = _n_;
_n_+1;
end;
output;
end;
format x: 5.;
run;
data want;
length id start mean10yrPriorStart 8;
set have;
array x x2000-x2018;
array slice(10) _temporary_;
call pokelong (
peekclong ( addrlong ( x(start-1999-10) ) , 10*8 ) ,
addrlong ( slice (1))
);
mean10yrPriorStart = mean(of slice(*));
run;

use an array and loop
index the array with years
accumulate the sum of the values
accumulate the count to account for any missing values
divide to obtain the mean value
data want;
set have;
array _pm(2000:2018) pm25_total2000 - pm25_total2018;
do year=reqyear to (reqyear-9) by -1;
*add totals;
total = sum(total, _pm(year));
*add counts;
nyears = sum(nyears,not missing(_pm(year)));
end;
*accounts for possible missing years;
mean = total/nyears;
run;
Note this loop goes in reverse (start year to 9 years previous) because it's slightly easier to understand this way IMO.
If you have no missing values you can remove the nyears step, but not a bad thing to include anyways.

NOTE: My first answer did not address the OP's question, so this a redux.
For this solution, I used Richard's code for generating test data. However, I added a line to randomly add missing values.
x = _n_;
if ranuni(1) < .1 then x = .;
_n_+1;
This alternative does not perform any checks for missing values. The sum() and n() functions inherently handle missing values appropriately. The loop over the dynamic slice of the data array only transfers the value to a temporary array. The final sum and count is performed on the temp array outside of the loop.
data want;
set have;
array x(2000:2018) x:;
array t(10) _temporary_;
j = 1;
do i = start-9 to start;
t(j) = x(i);
j + 1;
end;
sum = sum(of t(*));
cnt = n(of t(*));
mean = sum / cnt;
drop x: i j;
run;
Result:
id start sum cnt mean
1 2014 72 7 10.285714286
2 2015 305 10 30.5
3 2016 458 9 50.888888889
4 2017 631 9 70.111111111

Related

Calculate the top 5 and summarize them by store

Let's say I have stores all around the world and I want to know what was my top losses sales across the world per store. What is the code for that?!
here is my try:
proc sort data= store out=sorted_store;
by store descending amount;
run;
and
data calc1;
do _n_=1 by 1 until(last.store);
set sorted_store;
by store;
if _n_ <= 5 then "Sum_5Largest_Losses"n=sum(amount);
end;
run;
but this just prints out the 5:th amount and not 1.. TO .. 5! and I really don't know how to select the top 5 of EACH store . I think a kind of group by would be a perfect fit. But first things, first. How do I selct i= 1...5 ? And not just = 5?
There is also way of doing it with proc sql:
data have;
input store$ amount;
datalines;
A 100
A 200
A 300
A 400
A 500
A 600
A 700
B 1000
B 1100
C 1200
C 1300
C 1400
D 600
D 700
E 1000
E 1100
F 1200
;
run;
proc sql outobs=4; /* limit to first 10 results */
select store, sum(amount) as TOTAL_AMT
from have
group by 1
order by 2 desc; /* order them to the TOP selection*/
quit;
The data step sum(,) function adds up its arguments. If you only give it one argument then there is nothing to actually sum so it just returns the input value.
data calc1;
do _n_=1 by 1 until(last.store);
set sorted_store;
by store;
if _n_ <= 5 then Sum_5Largest_Losses=sum(Sum_5Largest_Losses,amount);
end;
run;
I would highly recommend learning the basic methods before getting into DOW loops.
Add a counter so you can find the first 5 of each store
As the data step loops the sum accumulates
Output sum for counter=5
proc sort data= store out=sorted_store;
by store descending amount;
run;
data calc1;
set sorted_store;
by store;
*if first store then set counter to 1 and total sum to 0;
if first.store then do ;
counter=1;
total_sum=0;
end;
*otherwise increment the counter;
else counter+1;
*accumulate the sum if counter <= 5;
if counter <=5 then total_sum = sum(total_sum, amount);
*output only when on last/5th record for each store;
if counter=5 then output;
run;

Macro does not retain in a IF-THEN loop in DATA STEP

data output;
set input;
by id;
if first.id = 1 then do;
call symputx('i', 1); ------ also tried %let i = 1;
a_&i = a;
end;
else do;
call symputx('i', &i + 1); ------ also tried %let i = %sysevalf (&i + 1);
a_&i = a;
end;
run;
Example Data:
ID A
1 2
1 3
2 2
2 4
Want output:
ID A A_1 A_2
1 2 2 .
1 3 . 3
2 2 2 .
2 4 . 4
I know that you can do this using transpose, but i'm just curious why does this way not work. The macro does not retain its value for the next observation.
Thanks!
edit: Since %let is compile time, and call symput is execution time, %let will run only once and call symput will always be 1 step slow.
why does this way not work
The sequence of behavior in SAS executor is
resolve macro expressions
process steps
automatic compile of proc or data step (compile-time)
run the compilation (run-time)
a running data step can not modify its pdv layout (part of the compilation process) while it is running.
call symput() is performed at run-time, so any changes it makes will not and can not be applied to the source code as a_&i = a;
Array based transposition
You will need to determine the maximum number of items in the groups prior to coding the data step. Use array addressing to place the a value in the desired array slot:
* Hand coded transpose requires a scan over the data first;
* determine largest group size;
data _null_;
set have end=lastrecord_flag;
by id;
if first.id
then seq=1;
else seq+1;
retain maxseq 0;
if last.id then maxseq = max(seq,maxseq);
if lastrecord_flag then call symputx('maxseq', maxseq);
run;
* Use maxseq macro variable computed during scan to define array size;
data want (drop=seq);
set have;
by id;
array a_[&maxseq]; %* <--- set array size to max group size;
if first.id
then seq=1;
else seq+1;
a_[seq] = a; * 'triangular' transpose;
run;
Note: Your 'want' is a triangular reshaping of the data. To achieve a row per id reshaping the a_ elements would have to be cleared (call missing()) at first.id and output at last.id.

Function for conditioned mean in SAS

I have a dataset which some cells are valorized with 888888888 and 999999999. I would do the mean not considering these values. That is:
x=5, y=10, z=888888888
the mean will be 5.
How can I fix?
As you're calculating across variables, just store them in an array, loop through them and sum any that are less than the required threshold (I've used 100,000,000), then divide by the total number of variables to get the mean.
data have;
input x y z;
datalines;
5 10 888888888
4 20 999999999
;
run;
data want;
set have;
array vars{*} x y z;
_sum=0;
do _i = 1 to dim(vars);
if vars{_i}<1e8 then _sum+vars{_i};
end;
mean_vars = _sum/dim(vars);
drop _: ;
run;

Determine rates of change for different groups

I have a SAS issue that I know is probably fairly straightforward for SAS users who are familiar with array programming, but I am new to this aspect.
My dataset looks like this:
Data have;
Input group $ size price;
Datalines;
A 24 5
A 28 10
A 30 14
A 32 16
B 26 10
B 28 12
B 32 13
C 10 100
C 11 130
C 12 140
;
Run;
What I want to do is determine the rate at which price changes for the first two items in the family and apply that rate to every other member in the family.
So, I’ll end up with something that looks like this (for A only…):
Data want;
Input group $ size price newprice;
Datalines;
A 24 5 5
A 28 10 10
A 30 14 12.5
A 32 16 15
;
Run;
The technique you'll need to learn is either retain or diff/lag. Both methods would work here.
The following illustrates one way to solve this, but would need additional work by you to deal with things like size not changing (meaning a 0 denominator) and other potential exceptions.
Basically, we use retain to cause a value to persist across records, and use that in the calculations.
data want;
set have;
by group;
retain lastprice rateprice lastsize;
if first.group then do;
counter=0;
call missing(of lastprice rateprice lastsize); *clear these out;
end;
counter+1; *Increment the counter;
if counter=2 then do;
rateprice=(price-lastprice)/(size-lastsize); *Calculate the rate over 2;
end;
if counter le 2 then newprice=price; *For the first two just move price into newprice;
else if counter>2 then newprice=lastprice+(size-lastsize)*rateprice; *Else set it to the change;
output;
lastprice=newprice; *save the price and size in the retained vars;
lastsize=size;
run;
Here a different approach that is obviously longer than Joe's, but could be generalized to other similar situations where the calculation is different or depends on more values.
Add a sequence number to your data set:
data have2;
set have;
by group;
if first.group the seq = 0;
seq + 1;
run;
Use proc reg to calculate the intercept and slope for the first two rows of each group, outputting the estimates with outest:
proc reg data=have2 outest=est;
by group;
model price = size;
where seq le 2;
run;
Join the original table to the parameter estimates and calculate the predicted values:
proc sql;
create table want as
select
h.*,
e.intercept + h.size * e.size as newprice
from
have h
left join est e
on h.group = e.group
order by
group,
size
;
quit;

Shortcut code writing with loop

I am trying to create a loop that wiil shortcut code writing.
I want that every veriable from x1- x30 will be equal: x square i, when i is the index of x1(i.e 1).
For example x7 will be x7=x**7;
I wrote a code, but it doesn't work. and i don't know how to fix him. I will glad for your help people.
DATA maarah (drop = i e);
e = constant("e");
do i = -10 to 10 by 0.01;
x=i;
y=e**x;
output;
end;
length x1-x30 $2001;
do i =1 to 30 by 1;
x i=x**i;
output;
end;
run;
You're close. You need to declare an array. You don't explain what the first half is (the e**i part), so it's not clear what you want here - do you want a few thousand rows with powers of e, and then some rows with x1-x30? And why do you output each time in the second loop? To answer the core question, here:
DATA maarah (drop = i e);
e = constant("e");
do i = -10 to 10 by 0.01;
x=i;
y=e**x;
output;
end;
*length x1-x30 $2001; *what is this? Why do you want it 2001 characters, instead of numeric?;
array xs x1-x30; *you would need a $ after this if you truly wanted character;
do i =1 to 30 by 1;
xs[i]=x**i;
*output; *You probably do not want this. Output is probably outside of the loop.;
end;
run;
I would guess what you really want is this:
DATA maarah (drop = i e);
e = constant("e");
do i = -10 to 10 by 0.01;
x=i;
y=e**x;
*length x1-x30 $2001; *what is this? Why do you want it 2001 characters, instead of numeric?;
array xs x1-x30; *you would need a $ after this if you truly wanted character;
do j =1 to 30;
xs[j]=x**j;
end; *the x1-x30 loop;
output;
end; *the outer loop;
run;