I have two tables A and B that look like below.
Table A
rowno flag1 flag2 flag3
1 1 0 0
2 0 1 1
3 0 0 0
4 0 1 1
5 0 0 1
6 0 0 0
7 0 0 0
8 0 1 0
9 0 0 0
10 1 0 0
Table B
rowno flag1 flag2 flag3
Table A and B have the same column names but B is an empty table initially.
So what I want to accomplish is to insert the values from A to B row by row using macro, iteration by rowno. And each time I insert one row from A to B, I want to calculate the sum of each flag column.
If after insert each row, the sum(flag1) > 1 or sum(flag2) >1 or sum(flag3) >1, I need to delete that inserted row from table B. Then the iteration keeps running till the end of the observation in Table A. The final output in Table B is to have 5 observations from table A.
the code I have so far is below:
%macro iteration;
%do rowno=1 %to 10;
proc sql;
insert into table.B
select *
from table.A
where rowno = &rowno;
quit;
set table.B;
if
sum(flag1) > 1
or
sum(flag2) > 1
or
sum(flag3) > 1
then delete;
run;
%end;
%mend iteration;
%iteration
I received a lot of error messages.
Looking forward to your help and suggestions. Thanks.
The ideal output data would look like this
rowno flag1 flag2 flag3
1 1 0 0
2 0 1 1
3 0 0 0
6 0 0 0
7 0 0 0
Instead of a macro, use a running sum to calculate the running sum of each row. If you need to delete a row remember to reverse the increment to the running sum. Based on your data, I think Row 9 should also be kept.
data TableA;
input rowno flag1 flag2 flag3;
cards;
1 1 0 0
2 0 1 1
3 0 0 0
4 0 1 1
5 0 0 1
6 0 0 0
7 0 0 0
8 0 1 0
9 0 0 0
10 1 0 0
;
run;
data TableB;
set TableA;
retain sum_:;
*Increment running sum for flag;
sum_flag1+flag1;
sum_flag2+flag2;
sum_flag3+flag3;
*Check flag amounts;
if sum_flag1>1 or sum_flag2>1 or sum_flag3>1 then do;
*if flag is tripped then delete increment to flag and remove record;
sum_flag1 +-flag1;
sum_flag2 +-flag2;
sum_flag3 +-flag3;
delete;
end;
run;
Related
I'm quite new to SAS,
I have learned about SGplot, Datalines, IML and randgen.
I'd like to simply generate a random data for a simple scatter plot.
/* declaring manually a numeric list */
data my_data;
input x y ##;
datalines;
1 1 0 8 1 6 0 1 0 1 2 5
0 3 1 0 1 0 1 4 2 4 1 0
0 0 0 1 1 2 1 1 0 4 1 0
1 4 1 0 1 3 0 0 0 1 0 1
1 0 1 1 2 3 0 2 1 4 2 6
2 6 1 0 1 1 0 1 2 8 1 3
1 3 0 5 1 0 5 5 0 2 3 3
0 1 1 0 1 0 0 0 0 3
;
run;
proc sgplot data=my_data;
scatter x=x y=y;
run;
Now I would like in a similar manner to generate a vector of random numbers, such as:
proc iml;
N = 94;
rands = j(N,1);
call randgen(rands, 'Uniform'); /* SAS/IML 12.1 */
run;
and afterwards to transfer the vector as datalines and afterwards pass it into the SGplot.
Can somebody please demonstrate how to do it?
Since you want to pass it directly to datalines, use the submit and text substitution options in IML. This passes rands as an Nx1 vector into the datalines statement, allowing you to read it as one big line.
proc iml;
N = 94;
rands = j(N,1);
call randgen(rands, 'Uniform'); /* SAS/IML 12.1 */
submit rands;
data my_data;
input x y ##;
datalines;
&rands
;
run;
endsubmit;
quit;
proc sgplot data=my_data;
scatter x=x y=y;
run;
Note you'll need to double your size of N to get exactly 94, otherwise you will have 47. This is because it is reading each pair on the same line before moving to the next line. e.g.:
1 2 3 4
x = 1 y = 2
x = 3 y = 4
Source: Passing Values into Procedures (Rick Wicklin)
In SAS, I would like to create a label that check the previous sell indicator: if the sell indicator of the previous time period is 1/0 and in the current is 0/1 (meaning that it has changed) then I assign a value 1 to the ind variable.
The dataset looks like:
Customer Time Sell_Ind
1 2 1
1 3 0
1 4 0
2 23 0
2 24 0
2 30 0
5 12 1
5 11 0
And so on.
My expected output would be
Customer Time Sell_Ind Ind
1 2 1 0
1 3 0 1
1 4 0 0
2 23 0 0
2 24 0 0
2 30 0 0
5 12 1 0
5 11 0 1
The previous/current check is meant by customer.
I have tried as follows
data mydata;
set original;
By customer;
Lag_sell_ind=lag(sell_ind);
If first.customer then Lag_sell_ind=.;
Run;
But it does not return the expected output.
In sql I would probably use partition by customer over time but I do not know how to do the same in SAS.
You were halfway through, you only need to add one if statement to achieve the desired output.
data want;
set have;
by customer;
lag=lag(sell_ind);
if first.customer then lag=.;
if sell_ind ne lag and lag ne . then ind = 1;
else ind = 0;
drop lag;
run;
You can simplify this using the IFN Function like below.
data have;
input Customer Time Sell_Ind;
datalines;
1 2 1
1 3 0
1 4 0
2 23 0
2 24 0
2 30 0
5 12 1
5 11 0
;
data want;
set have;
by customer;
Lag_sell_ind = ifn(first.customer, 0, lag(sell_ind));
Run;
How to recognize the first "1,0" sequence in column "Flag" from each group and mark a "1" just like it in column "Flag2"?
ID Flag Flag2
1 1
1 1 1
1 0
1 1
1 0
1 0
2 1
2 1
2 1
2 1 1
2 0
2 0
3 0
3 0
3 0
3 0
4 1
4 1 1
4 0
4 1
The problem requires using a 'lead' concept (value from next row) similar to the lag concept provided by the lag function. There is no built in lead function so you need to be creative.
Merge the data to itself, without a by statement, where the second version is:
Offset by one row by the firstobs data set option
Renames the variables so the lead state can be established with an if
A retained variable tracks if the 1,0 transition has been observed within the group.
Sample code:
data have;input
ID Flag; datalines;
1 1
1 1
1 0
1 1
1 0
1 0
2 1
2 1
2 1
2 1
2 0
2 0
3 0
3 0
3 0
3 0
4 1
4 1
4 0
4 1
run;
data want;
merge
have
have(firstobs=2 rename=(id=lead_id flag=lead_flag))
;
retain flagged_id;
if (id=lead_id) /* lead is in same group */
and (flag=1) and (lead_flag=0) /* transition identified */
and (flagged_id ne id) then /* first such transition for group */
do;
flag2=1; /* flag the lead transition */
flagged_id = id; /* track id where transition last flagged */
end;
drop lead_: flagged:;
run;
My dataset is like this
bucket D_201009 D_201010 D_201011 D_201012 D_201101 D_201102 D_201103
0 0 0 0 0 0 0 0
1 1 0 0 0 1 0 0
2 3 0 3 0 1 6 3
3 0 0 0 0 0 0 0
4 0 4 0 0 0 0 0
5 4 0 4 0 4 8 1
6 8 0 8 0 8 10 8
7 0 0 0 0 0 0 0
8 7 0 7 0 0 7 3
what I want is this
bucket D_201009 D_201010 D_201011 D_201012 D_201101 D_201102 D_201103
0 23 4 22 0 14 31 15
1 23 4 22 0 14 31 15
2 22 4 22 0 13 31 15
3 19 4 19 0 12 25 12
4 19 4 19 0 12 25 12
5 19 0 19 0 12 25 12
6 15 0 15 0 8 17 11
7 7 0 7 0 0 7 3
8 7 0 7 0 0 7 3
where the sum is the value for bucket 0 and 1 row the corresponding bucket 2 for column D_201009 =sum-original value(1) and later for bucket 3 for column D_201009 previous value(lag value) -3(value original) and label this column as original column name. I wrote the code to perform one column.
data test;
input bucket D_201009 D_201010 D_201011 D_201012 D_201101 D_201102 D_201103;
datalines;
0 0 0 0 0 0 0 0
1 1 0 0 0 1 0 0
2 3 0 3 0 1 6 3
3 0 0 0 0 0 0 0
4 0 4 0 0 0 0 0
5 4 0 4 0 4 8 1
6 8 0 8 0 8 10 8
7 0 0 0 0 0 0 0
8 7 0 7 0 0 7 3
;
run;
Saving these column names in a macro
proc contents data = test
out = vars(keep = varnum name)
noprint;
run;
proc sql noprint;
select distinct name
into :orderedvars2 separated by ' '
from vars
where varnum >=2
order by varnum;
quit;
Finding sum of one column only
proc sql;
select sum(D_201009) into :total from test;
quit;
Using lag to perform
data result(drop= D_201009 lag_D_201009 rename=(sum=D_201009));
set test;
retain sum;
if bucket < 2 then sum = &total;
sum = sum(sum, -lag(D_201009));
run;
how do I change the code to work for all columns where the column names are stored as macro &orderedvars2. ?
The way I'd approach it would be to transpose the data structure to a more useful data structure; then you don't have to use macro variables. You can use BY processing instead, and no lags.
The way I create the final output is to transpose the initial dataset so you have one row per bucket/D_var, then sort by the D_vars (_NAME_ holds that). Then use a Double DoW loop in order to first calculate the sum, and then to subtract the value. Note I don't have to use Retain or Lag here, I can just directly operate on the value since I'm in a DoW loop. I output before subtracting since that's what you seem to want. Then I retranspose back.
This might not be the fastest option if you have very large data, since it goes through several steps; if you do, you should be using a more efficient algorithm anyway. But it's likely the least fiddly if you don't always have the same columns.
proc transpose data=test out=test_t;
by bucket;
run;
proc sort data=test_t;
by _name_ bucket;
run;
data want_t;
do _n_ = 1 by 1 until (last._name_);
set test_t;
by _name_ bucket;
sum_var = sum(sum_var,col1);
end;
do _n_ = 1 by 1 until (last._name_);
set test_t;
by _name_ bucket;
output;
sum_var = sum_var - col1;
end;
run;
proc sort data=want_t;
by bucket _name_;
run;
proc transpose data=want_t out=want;
by bucket;
id _name_;
var sum_var;
run;
Use proc summary to get sum of each variable, then define multiple arrays.
proc summary data=test;
var D:;
output out=sum(drop=_:) sum=/autoname;
run;
data want;
set test;
if _n_=1 then set sum;
array var1 D_201009--D_201103;
array var2 D_201009_sum--D_201103_sum;
array var3 _D_201009 _D_201010 _D_201011 _D_201012 _D_201101 _D_201102 _D_201103;
array temp (7) _temporary_;
retain temp;
do i=1 to dim(var1);
lag=lag(var1(i));
if bucket<2 then var3(i)=var2(i);
else var3(i)=sum(temp(i),-lag);
temp(i)=var3(i);
end;
drop D: lag i;
run;
If I understand this right you want to sum the column and then subtract the value of each observation from the total?
Getting totals is easy, just use proc summary.
Then combine it with the original data. Here is a way that will work without having to worry about the actual variable names. In this program it will sum all variables that start with d_ but you could use any variable list you want. If you have more than 100 variables then change the dimension of the temporary array.
%let varlist=d_:;
* Get sums into variables with same names ;
proc summary data=have ;
var &varlist ;
output out=total sum= ;
run;
data want ;
set have(obs=0) /* Set variable order */
total(keep=&varlist) /* Get totals */
have(keep=&varlist) /* Get lagged variables */
;
array vars &varlist ;
array total (100) _temporary_;
set have (drop=&varlist); /* Get non-lagged variables */
do i=1 to dim(vars);
if _n_>1 then vars(i)=total(i)-vars(i);
total(i)=vars(i);
end;
drop i;
run;
If you have missing values you might want to add this line of code at beginning of the DO loop:
vars(i)=coalesce(vars(i),0);
I have a data set which essence is the following
data have;
input Name $ ab gh vz iz jh pq ch km eo lk;
datalines;
adam 7 8 7 0 0 0 0 0 0 0
bob 0 1 0 3 4 6 0 1 6 0
clint 0 0 0 5 4 3 1 0 0 2
;
run;
Now I would like to count how many times I have a number greater than zero in the variables iz, jh, chand km. The result should look like this
/* want
Name ab gh vz iz jh pq ch km eo lk count_of_iz_jh_ch_km
adam 7 8 7 0 2 3 0 0 0 0 1
bob 0 1 0 3 0 6 0 1 6 0 2
clint 5 0 0 5 4 3 1 2 0 2 4
*/
I would greatly appreciate any help since I wasn't successful searching the internet for a solution.
Gerit
The below code will initialize the required variables from have into an array called vars, then for each row, count every time one of these variables is > 0.
data want;
set have;
array vars[*] iz jh ch km;
count_of_iz_ch_km = 0;
do i = 1 to dim(vars);
if(vars[i] > 0) then count_of_iz_ch_km+1;
end;
drop i;
run;