I am trying to sum one variable as long as another remains constant. I want to cumulative sum dur as long as a is constant. when a changes the sum restarts. when a new id, the sum restarts.
enter image description here
and I would like to do this:
enter image description here
Thanks
You can use a BY statement to specify the variables whose different value combinations organize data rows into groups. You are resetting an accumulated value at the start of each group and adding to the accumulator at each row in the group. Use retain to maintain a new variables value between the DATA step implicit loop iterations. The SUM statement is a unique SAS feature for accumulating and retaining.
Example:
data want;
set have;
by id a;
if first.a then mysum = 0;
mysum + dur;
run;
The SUM statement is different than the SUM function.
<variable> + <expression>; * SUM statement, unique to SAS (not found in other languages);
can be thought of as
retain <variable>;
<variable> = sum (<variable>, <expression>);
As far as I am concerned, you need to self-join your table with a ranked column.
It should be ranked by id and a columns.
FROM WORK.QUERY_FOR_STCKOVRFLW t1; is the table you provided in the screenshot
PROC SQL;
CREATE TABLE WORK.QUERY_FOR_STCKOVRFLW_0001 AS
SELECT t1.id,
t1.a,
t1.dur,
/* mono */
(monotonic()) AS mono
FROM WORK.QUERY_FOR_STCKOVRFLW t1;
QUIT;
PROC SORT
DATA=WORK.QUERY_FOR_STCKOVRFLW_0001
OUT=WORK.SORTTempTableSorted
;
BY id a;
RUN;
PROC RANK DATA = WORK.SORTTempTableSorted
TIES=MEAN
OUT=WORK.RANKRanked(LABEL="Rank Analysis for WORK.QUERY_FOR_STCKOVRFLW_0001");
BY id a;
VAR mono;
RANKS rank_mono ;
RUN; QUIT;
PROC SQL;
CREATE TABLE WORK.QUERY_FOR_RANKRANKED AS
SELECT t1.id,
t1.a,
t1.dur,
/* SUM_of_dur */
(SUM(t2.dur)) FORMAT=BEST12. AS SUM_of_dur
FROM WORK.RANKRANKED t1
LEFT JOIN WORK.RANKRANKED t2 ON (t1.id = t2.id) AND (t1.a = t2.a AND (t1.rank_mono >= t2.rank_mono ))
GROUP BY t1.id,
t1.a,
t1.dur;
QUIT;
Related
I am trying to calculate some statistics for a given variable based on the client id and the time horizon. My current solution is show below, however, I would like to know if there is a way to reformat the code into a datastep instead of an sql join, because the join takes a very long time to execute on my real dataset.
data have1(drop=t);
id = 1;
dt = '31dec2020'd;
do t=1 to 10;
dt = dt + 1;
var = rand('uniform');
output;
end;
format dt ddmmyyp10.;
run;
data have2(drop=t);
id = 2;
dt = '31dec2020'd;
do t=1 to 10;
dt = dt + 1;
var = rand('uniform');
output;
end;
format dt ddmmyyp10.;
run;
data have_fin;
set have1 have2;
run;
Proc sql;
create table want1 as
select a.id, a.dt,a.var, mean(b.var) as mean_var_3d
from have_fin as a
left join have_fin as b
on a.id = b.id and intnx('day',a.dt,-3,'S') < b.dt <= a.dt
group by 1,2,3;
Quit;
Proc sql;
create table want2 as
select a.id, a.dt,a.var, mean(b.var) as mean_var_3d
from have_fin as a
left join have_fin as b
on a.id = b.id and intnx('day',a.dt,-6,'S') < b.dt <= a.dt
group by 1,2,3;
Quit;
Use temporary arrays and a single data step instead.
This does the same thing in a single step.
Sort data to ensure order is correct
Declare a temporary array for each set of moving average you want to calculate.
Ensure the array is empty at the start of each ID
Assign values to array in correct index. MOD() allows you to dynamically index the data without have to include a separate counter variable.
Take the average of the array. If you want the array to ignore the first two values - because it has only 1/2 data points you can conditionally calculate this as well.
*sort to ensure data is in correct order (Step 1);
proc sort data=have_fin;
by id dt;
run;
data want;
*Step 2;
array p3{0:2} _temporary_;
array p6(0:5) _temporary_;
set have_fin;
by ID;
*clear values at the start of each ID for the array Step3;
if first.ID then call missing(of p3{*}, of p6(*));
*assign the value to the array, the mod function indexes the array so it's continuously the most recent 3/6 values;
*Step 4;
p3{mod(_n_,3)} = var;
p6{mod(_n_,6)} = var;
*Step 5 - calculates statistic of interest, average in this case;
mean3d = mean(of p3(*));
mean6d = mean(of p6(*));
;
run;
And if you have SAS/ETS licensed this is super trivial.
*prints product to log - check if you have SAS/ETS licensed;
proc product_status;run;
*sorts data;
proc sort data=have_fin;
by id dt;
run;
*calculates moving average;
proc expand data=have_fin out=want_expand;
by ID;
id dt;
convert var = mean_3d / method=none transformout= (movave 3);
convert var = mean_6d / method=none transformout= (movave 6);
run;
proc sql;
create table abc as select distinct formatted_date ,Contract, late_days
from merged_dpd_raw_2602
group by 1,2
;quit;
this gives me the 3 variables I\m working with
they have the form
|ID|Date in YYMMs.10| number|
proc sql;
create table max_dpd_per_contract as select distinct contract, max(late_days) as DPD_for_contract
from sasa
group by 1
;quit;
this gives me the maximum number for the entire period but how do I go on to make it per period?
I'm guessing the timeseries procedure should be used here.
proc timeseries data=sasa
out=sasa2;
by contract;
id formatted_date interval=day ACCUMULATE=maximum ;
trend maximum ;
var late_days;
run;
but I am unsure how to continue.
I want to to find the maximum value of the variable "late days" per a given time period(month). So for contact A for the time period jan2018 the max late_days value is X.
how the data looks:https://imgur.com/iIufDAx
In SQL you will want to calculate your aggregate within a group that uses a computed month value.
Example:
data have;
call streaminit(2021);
length contract date days_late 8;
do contract = 1 to 10;
days_late = 0;
do date = '01jan2020'd to '31dec2020'd;
if days_late then
if rand('uniform') < .55 then
days_late + 1;
else
days_late = 0;
else
days_late + rand('uniform') < 0.25;
output;
end;
end;
format date date9.;
run;
options fmterr;
proc sql;
create table want as
select
contract
, intnx('month', date, 0) as month format = monyy7.
, max(days_late) as max_days_late
from
have
group by
contract, month
;
You will get the same results using Proc MEANS
proc means nway data=have noprint;
class contract date;
format date monyy7.;
output out=want_2 max(days_late) = max_days_late;
run;
i have a dataset name censusdata with 11346 observation in last some observationare blank data.we have to find total population variable name t_p.
i am using this code:
data q1(keep=t_p count);
set censusdata;
array num(*) t_p;
retain count;
do i=1 to dim(num);
if t_p = i then count=t_p;
else count+t_p;
end;
run;
problem is sas find sum of first 3236 observation then do sum of 3237 to 4683 observation and so on.they cannot do sum of all observation as we need.
we need sum of totalpopulation(t_p) & we need output dataset like this
totalpopulation=number
Sum the variable in a proc sql step:
proc sql;
create table q1 as select
sum(t_p) as total_pop
from censusdata;
quit;
I have table1 that contains one column (city), I have a second table (table2) that has two columns (city, distance),
I am trying to create a third table, table 3, this table contains two columns (city, distance), the city in table 3 will come from the city column in table1 and the distance will be the corresponding distance in table2.
I tried doing this using Proc IML based on Joe's suggestion and this is what I have.
proc iml;
use Table1;
read all var _CHAR_ into Var2 ;
use Table2;
read all var _NUM_ into Var4;
read all var _CHAR_ into Var5;
do i=1 to nrow(Var2);
do j=1 to nrow(Var5);
if Var2[i,1] = Var5[j,1] then
x[i] = Var4[i];
end;
create Table3 from x;
append from x;
close Table3 ;
quit;
I am getting an error, matrix x has not been set to a value. Can somebody please help me here. Thanks in advance.
The technique you want to use is called the "unique-loc technique". It enables you to loop over unique values of a categorical variable (in this case, unique cities) and do something for each value (in this case, copy the distance into another array).
So that others can reprodce the idea, I've imbedded the data directly into the program:
T1_City = {"Gould","Boise City","Felt","Gould","Gould"};
T2_City = {"Gould","Boise City","Felt"};
T2_Dist = {10, 15, 12};
T1_Dist = j(nrow(T1_City),1,.); /* allocate vector for results */
do i = 1 to nrow(T2_City);
idx = loc(T1_City = T2_City[i]);
if ncol(idx)>0 then
T1_Dist[idx] = T2_Dist[i];
end;
print T1_City T1_Dist;
The IF-THEN statement is to prevent in case there are cities in Table2 that are not in Table1. You can read about why it is important to use that IF-THEN statement. The IF-THEN statement is not needed if Table2 contains all unique elements of Table1 cities.
This technique is discussed and used extensively in my book Statistical Programming with SAS/IML Software.
You need a nested loop, or to use a function that finds a value in another matrix.
IE:
do i = 1 to nrow(table1);
do j = 1 to nrow(table2);
...
end;
end;
I am trying to convert a categorical variable (Product) in binary and then want to know how many products per customer.
data is in the following format:
ID Product
C1 A
C1 B
C2 A
C3 B
C4 A
The code I am using for converting category to binary
IF PRODUCT="A" THEN PROD_A =1 ; ELSE PROD_A=0;
IF PRODUCT="B" THEN PROD_B =1 ; ELSE PROD_B=0;
TOT_PROD = SUM(PROD_A, PROD_B);
But when I count no. of product it gives me '1' for all customer and I am expecting 1 or 2.
I have tried
TOT_PROD = PROD_A + PROD_B;
but I get the same results
This is all inside one datastep, correct? If so you're processing only one line at a time. For each individual line the only possible values for PROD_A and PROD_B are one or zero. You need an aggregate function. For example, if your dataset is named PRODUCTS:
DATA X;
SET PRODUCTS;
IF PRODUCT="A" THEN PROD_A = 1 ; ELSE PROD_A=0;
IF PRODUCT="B" THEN PROD_B = 1 ; ELSE PROD_B=0;
TOT_PROD = SUM(PROD_A, PROD_B);
RUN;
(TOT_PROD will always be equal to 1 in X, but never mind for now).
Now sum them up:
proc sql;
create table prod_totals as
select product, sum(tot_prod) as total_products
from x
group by product;
quit;
More simply just skip the data step:
proc sql;
create table prod_totals as
select product, count(*) as total_products
from products
group by product;
quit;
Or use PROC SUMMARIZE or PROC MEANS instead of PROC SQL.
I have assumed you only want 1 record output per id.
In the solutions below I have employed the DOW-Loop (DO-Whitlock).
If you wanted prod_a and prod_b just to help with the totals and if they're not required in the output, then you could use something like:
data want;
do until(last.id);
set have;
by id;
tot_prod=sum(tot_prod,product='A',product='B');
end;
run;
If you need prod_a and prod_b in the output, then you could use:
data want;
do until(last.id);
set have;
by id;
prod_a=(product='A');
prod_b=(product='B');
tot_prod=sum(tot_prod,prod_a,prod_b);
end;
run;
In both data steps the last product per id will be output along with the other variables and in the case of the 2nd data step example the last prod_a & prod_b per id will also be output.
To do this in the data step, you need retain. Make sure you've sorted the dataset by id first.
data prod_totals;
set products;
by ID;
retain prod_a prod_b;
if first.id then do; *initialize to zero for each new ID;
prod_a=0; prod_b=0;
end;
if product='A' then prod_a=1; *set to 1 for each one found;
else if product='B' then prod_b=1;
if last.id then do; *for last record in each ID, output and sum total;
total_products=sum(prod_a,prod_b);
output;
end;
keep id prod_a prod_b total_products;
run;