Moving next observation to previous column to make a triangle in sas - sas

I able to populate the input and now want to convert the below input triangle to right angle triangle as show in the output.
INPUT
asur C0012 C0112 C0212 C0312 C0412 C0512
2000 5133049 2629201 3145968 3710712 4023650 4090428
2001 1413328 2535620 2348286 3357177 3389958
2002 1594953 2663058 3003008 3139910
2003 1694882 3616471 4201837
2004 1861858 3567559
2005 17853454
2006
2007
OUTPUT
asur C0012 C0112 C0212 C0312 C0412 C0512
2000 5133049 2629201 3145968 3710712 4023650 4090428
2001 1413328 2535620 2348286 3357177 3389958
2002 1594953 2663058 3003008 3139910
2003 1694882 3616471 4201837
2004 1861858 3567559
2005 1785345
2006
2007

You need to show what was tried, otherwise we are just doing your homework.
data have;
input
asur C0012 C0112 C0212 C0312 C0412 C0512;
datalines;
2000 5133049 2629201 3145968 3710712 4023650 4090428
2001 . 1413328 2535620 2348286 3357177 3389958
2002 . . 1594953 2663058 3003008 3139910
2003 . . . 1694882 3616471 4201837
2004 . . . . 1861858 3567559
2005 . . . . . 17853454
run;
Use an ARRAY statement to allow indexed address processing of the variables. i and j track which slots the values move from and to.
data want;
set have;
array C C0012 C0112 C0212 C0312 C0412 C0512;
j = 0;
if _n_ > 1 then
do i = _n_ to dim(c); * source slot: start on the diagonal;
j = j + 1; * target slot: iteration #;
C[j] = C[i]; * shift data leftward;
C[i] = . ; * clear slot data was in;
end;
drop i j ;
run;

Related

The drop sentence doesn't work with variables created with macro arrays SAS

I am trying to run the following code, that calculates the slope of each variables' timeseries.
I need to drop the variables created with an array, because I use the same logic for other functions.
Nevertheless the output data keeps the variables ys_new&i._: and I get the warning: The variable 'ys_new3_:'n in the DROP, KEEP, or RENAME list has never been referenced.
I think the iterator is evaluated to 3 in the %do %while block.
If someone can help me, I will really apreciated it.
DATA HAVE;
INPUT ID N_TRX_M0-N_TRX_M12 TRANSACTION_AMT_M0-TRANSACTION_AMT_M12;
DATALINES;
1 3 6 3 3 7 8 6 10 5 5 8 7 7 379866 856839 307909 239980 767545 511806 603781 948936 566114 402214 844657 2197164 817390
2 51 56 55 73 48 57 54 53 55 52 49 72 53 6439314 7367157 4614827 9465017 3776064 3661525 7870605 3971889 4919128 10024385 4660264 7748467 7339863
3 5 . . . . . . . . . . . . 232165 . . . . . . . . . . . .
;
RUN;
%Macro slope(variables)/parmbuff;
%let i = 1;
/* Get the first Parameter */
%let parm_&i = %scan(&syspbuff,&i,%str( %(, %)));
%do %while (%str(&&parm_&i.) ne %str());
array ys&i(12) &&parm_&i.._M12 - &&parm_&i.._M1;
array ys_new&i._[12];
/* Corre los valores missing*/
k = 1;
do j = 1 to 12;
if not(missing(ys&i(j))) then do;
ys_new&i._[k] = ys&i[j];
k + 1;
end;
end;
nonmissing = n(of ys_new&i._{*});
xbar = (nonmissing + 1)/2;
if nonmissing ge 2 then do;
ybar = mean(of ys&i(*));
cov = 0;
varx = 0;
do m=1 to nonmissing;
cov=sum(cov, (m-xbar)*(ys_new&i._(m)-ybar));
varx=sum(varx, (m-xbar)**2);
end;
slope_&&parm_&i. = cov/varx;
end;
%let i = %eval(&i+1);
/* Get next parm */
%let parm_&i = %scan(&syspbuff ,&i, %str( %(, %)));
%end;
drop ys_new&i._: k j m nonmissing ybar xbar cov varx;
%mend;
%let var_slope =
N_TRX,
TRANSACTION_AMT
;
DATA FEATURES;
SET HAVE;
%slope(&var_slope)
RUN;
The simplest solution is to generate the DROP statement before the macro has a chance to change the value of the macro variable I .
array ys&i(12) &&parm_&i.._M12 - &&parm_&i.._M1;
array ys_new&i._[12];
drop ys_new&i._: k j m nonmissing ybar xbar cov varx;
You could use a _TEMPORARY_ array instead, but then you need to remember to clear the values on each iteration of the data step.
array ys_new&i._[12] _temporary_;
call missing(of ys_new&i._[*]);
Then you can leave the DROP statement at the end if you want.
drop k j m nonmissing ybar xbar cov varx;
You are correct. &i is 3 after it exits the do loop leading to the drop statement giving a warning that ys_new3_: does not exist. Instead, consider using a temporary array to avoid the drop statement altogether:
array ys_new&i._[12] _TEMPORARY_;

SAS for loop questions

Players from 1 to 50 are placed in a row in order. The coach said: "Odds number athletes out!" The remaining athletes re-queue and re-number. The coach ordered again: "Odds number athletes out!" In this way, there is only one person left at last. What number of athletes is he? What if the coach's keep ordering "Even number athletes out!" Who is left at the end?
I know it requires me to use loop in SAS to answer the question. But can only write code below:
data a;
do i=1 to 50;
output;
end;
run;
proc sql;
select i
from a
where mod(i,2**5)=0;
quit;
But it won't work for keeping the last odd number athelete. Could you guys figure out a way to simulate this process by using loop? Thanks so much
#Doris welcome :-)
Try this. The Final_Player data set contains the number of the final player in the simulation.
Simply change the mod(N, 2) = 0 to = 1 for the even problem. Feel free to ask.
data _null_;
dcl hash h(ordered : 'y');
h.definekey('p');
h.definedone();
dcl hiter ih('h');
dcl hash i(ordered : 'Y');
i.definekey('id');
i.definedone();
dcl hiter ii('i');
do p = 1 to 50;
h.add();
end;
id = .;
do while (h.num_items > 1);
do _N_ = 1 by 1 while (ih.next() = 0);
if mod(_N_, 2) = 1 then do;
i.add(key : p, data : p);
end;
end;
do while (ii.next() = 0);
rc = h.remove(key : id);
end;
i.clear();
end;
h.output(dataset : 'Final_Player');
run;
Just use algebra.
want = 2 ** floor( log2(n) );
So if you are starting with an arbitrary dataset you can find the one observation you need directly.
data want;
point = 2**floor(log2(nobs));
set a point=point nobs=nobs;
output;
stop;
put i= ;
run;
Here is example using array showing how it works.
373 data test;
374 array x [15];
375 do index=1 to dim(x); x[index]=index; end;
376 do iteration=1 by 1 while(n(of x[*])>1);
377 do index= 2**(iteration-1) to dim(x) by 2**iteration ;
378 x[index]=.;
379 end;
380 put iteration= (x[*]) (3.);
381 end;
382 do index=1 to dim(x) until(x[index] ne .);
383 end;
384 put index= x[index]= ;
385
386 run;
iteration=1 . 2 . 4 . 6 . 8 . 10 . 12 . 14 .
iteration=2 . . . 4 . . . 8 . . . 12 . . .
iteration=3 . . . . . . . 8 . . . . . . .
index=8 x8=8

SAS Comparing values across multiple columns of the same observation?

I have observations with column ID, a, b, c, and d. I want to count the number of unique values in columns a, b, c, and d. So:
I want:
I can't figure out how to count distinct within each row, I can do it among multiple rows but within the row by the columns, I don't know.
Any help would be appreciated. Thank you
********************************************UPDATE*******************************************************
Thank you to everyone that has replied!!
I used a different method (that is less efficient) that I felt I understood more. I am still going to look into the ways listed below however to learn the correct method. Here is what I did in case anyone was wondering:
I created four tables where in each table I created a variable named for example ‘abcd’ and placed a variable under that name.
So it was something like this:
PROC SQL;
CREATE TABLE table1_a AS
SELECT
*
a as abcd
FROM table_I_have_with_all_columns
;
QUIT;
PROC SQL;
CREATE TABLE table2_b AS
SELECT
*
b as abcd
FROM table_I_have_with_all_columns
;
QUIT;
PROC SQL;
CREATE TABLE table3_c AS
SELECT
*
c as abcd
FROM table_I_have_with_all_columns
;
QUIT;
PROC SQL;
CREATE TABLE table4_d AS
SELECT
*
d as abcd
FROM table_I_have_with_all_columns
;
QUIT;
Then I stacked them (this means I have duplicate rows but that ok because I just want all of the variables in 1 column and I can do distinct count.
data ALL_STACK;
set
table1_a
table1_b
table1_c
table1_d
;
run;
Then I counted all unique values in ‘abcd’ grouped by ID
PROC SQL ;
CREATE TABLE count_unique AS
SELECT
My_id,
COUNT(DISTINCT abcd) as Count_customers
FROM ALL_STACK
GROUP BY my_id
;
RUN;
Obviously, it’s not efficient to replicate a table 4 times just to put a variables under the same name and then stack them. But my tables were somewhat small enough that I could do it and then immediately delete them after the stack. If you have a very large dataset this method would most certainly be troublesome. I used this method over the others because I was trying to use Procs more than loops, etc.
A linear search for duplicates in an array is O(n2) and perfectly fine for small n. The n for a b c d is four.
The search evaluates every pair in the array and has a flow very similar to a bubble sort.
data have;
input id a b c d; datalines;
11 2 3 4 4
22 1 8 1 1
33 6 . 1 2
44 . 1 1 .
55 . . . .
66 1 2 3 4
run;
The linear search for duplicates will occur on every row, and the count_distinct will be initialized automatically in each row to a missing (.) value. The sum function is used to increment the count when a non-missing value is not found in any prior array indices.
* linear search O(N**2);
data want;
set have;
array x a b c d;
do i = 1 to dim(x) while (missing(x(i)));
end;
if i <= dim(x) then count_distinct = 1;
do j = i+1 to dim(x);
if missing(x(j)) then continue;
do k = i to j-1 ;
if x(k) = x(j) then leave;
end;
if k = j then count_distinct = sum(count_distinct,1);
end;
drop i j k;
run;
Try to transpose dataset, each ID becomes one column, frequency each ID column by option nlevels, which count frequency of value, then merge back with original dataset.
Proc transpose data=have prefix=ID out=temp;
id ID;
run;
Proc freq data=temp nlevels;
table ID:;
ods output nlevels=count(keep=TableVar NNonMisslevels);
run;
data count;
set count;
ID=compress(TableVar,,'kd');
drop TableVar;
run;
data want;
merge have count;
by id;
run;
one more way using sortn and using conditions.
data have;
input id a b c d; datalines;
11 2 3 4 4
22 1 8 1 1
33 6 . 1 2
44 . 1 1 .
55 . . . .
66 1 2 3 4
77 . 3 . 4
88 . 9 5 .
99 . . 2 2
76 . . . 2
58 1 1 . .
50 2 . 2 .
66 2 . 7 .
89 1 1 1 .
75 1 2 3 .
76 . 5 6 7
88 . 1 1 1
43 1 . . 1
31 1 . . 2
;
data want;
set have;
_a=a; _b=b; _c=c; _d=d;
array hello(*) _a _b _c _d;
call sortn(of hello(*));
if a=. and b = . and c= . and d =. then count=0;
else count=1;
do i = 1 to dim(hello)-1;
if hello(i) = . then count+ 0;
else if hello(i)-hello(i+1) = . then count+0;
else if hello(i)-hello(i+1) = 0 then count+ 0;
else if hello(i)-hello(i+1) ne 0 then count+ 1;
end;
drop i _:;
run;
You could just put the unique values into a temporary array. Let's convert your photograph into data.
data have;
input id a b c d;
datalines;
11 2 3 4 4
22 1 8 1 1
33 6 . 1 2
44 . 1 1 .
;
So make an array of the input variables and another temporary array to hold the unique values. Then loop over the input variables and save the unique values. Finally count how many unique values there are.
data want ;
set have ;
array unique (4) _temporary_;
array values a b c d ;
call missing(of unique(*));
do _n_=1 to dim(values);
if not missing(values(_n_)) then
if not whichn(values(_n_),of unique(*)) then
unique(_n_)=values(_n_)
;
end;
count=n(of unique(*));
run;
Output:
Obs id a b c d count
1 11 2 3 4 4 3
2 22 1 8 1 1 2
3 33 6 . 1 2 3
4 44 . 1 1 . 1

SAS retain statement not working as I hoped

I have the following dataset
data have;
input SUBJID VISIT$ PARAMN ABLF$ AVAL;
cards;
1 screen 1 . 151
1 random 1 YES .
1 visit1 1 . .
1 screen 2 . 65.5
1 random 2 YES 65
1 visit1 2 . .
1 screen 3 . .
1 random 3 YES 400
1 visit1 3 . 420
;
run;
I want to create another variable called BASE that captures the value of AVAL (when there is an actual value in place) when ABLF=YES and and then drag it down until a new PARAMN is encountered.
Basically I want the output to look like this
SUBJID VISIT$ PARAMN ABLF$ AVAL BASE;
1 screen 1 . 151 .
1 random 1 YES . .
1 visit1 1 . . .
1 screen 2 . 65.5 65
1 random 2 YES 65 65
1 visit1 2 . . 65
1 screen 3 . . 400
1 random 3 YES 400 400
1 visit1 3 . 420 400
I used the the following code
data want;
set have;
by SUBJID PARAMN;
if first.PARAMN and ABLF=' ' then BASE=.;
if ABLF='YES' then BASE=AVAL;
retain BASE;
run;
however when I run this I don't the data to look exactly as I want above
RETAIN does not look like the right tool for this. RETAIN can only move data forward in the file. It cannot move it backwards.
Looks like there is just one observation with the "BASE" value. So just merge it back onto the data.
data want;
merge have
have(keep=subjid paramn aval ablf rename=(aval=BASE ablf=xx)
where=(xx='YES'))
;
by SUBJID PARAMN;
drop xx;
run;
Pro SQL:
proc sql;
select a.*,b.aval as BASE from have a left join have(drop=visit where=(ablf='YES')) b
on a.subjid=b.subjid and a.paramn=b.paramn;
quit;
Double do loop:
data want;
do until(last.visit);
set have;
retain temp;
by subjid paramn notsorted;
if ablf='YES' then temp=aval;
end;
do until(last.visit);
set have;
by subjid paramn notsorted;
base=temp;
end;
drop temp;
run;

Puzzle: How to create a new ID for a union set that has intersection (SAS)

How would I create an ID3 column for any overlapping ID1/ID2 column? (Not just the intersection, but for the union set.
For example, ID2 is an account number online, and ID2 is an IP address that a person used to log on. I would like to specify that as long as it was a same IP address OR the log-on ID, the rows are assigned the same ID3.
I am using SAS to code.
HAVE
Year ID2 ID1
2010 1 201
2010 1 202
2010 2 203
2011 3 202
2011 4 203
2011 5 204
WANT
Year ID3 ID2 ID3
2010 101 1 201
2010 101 1 202
2010 102 2 203
2011 101 3 202
2011 102 4 203
2011 105 5 204
This is a problem that's generally considered solved, but different problems have different more efficient solutions depending on the likelihood of multiple linkages.
An example (fairly clunky and not particularly efficient, but hopefully explains the general solution) follows. Basically, you need to go through the data and store linkages in a separate structure - arrays or hash tables are most common - and then at the end, output the results of the linkages. You can then merge that back to the main dataset (not provided).
data have;
input Year ID2 ID1 ;
datalines;
2010 1 201
2010 1 202
2010 2 203
2011 3 202
2011 3 204
2011 4 203
2011 5 204
2011 6 203
2011 7 205
;;;;
run;
data want;
set have end=eof;
array new_id[1000] _temporary_;
array new_id1_C[1000] _temporary_;
array new_id2_C[1000] _temporary_;
_i1 = whichn(id1,of new_id1_c[*]);
_i2 = whichn(id2,of new_id2_c[*]);
if not (_i1+_i2) then do;
_eiter+1;
_id3+1;
put _i1= _i2= id1= id2= _eiter= _id3=;
new_id[_eiter]=_id3;
new_id1_c[_eiter] = id1;
new_id2_c[_eiter] = id2;
end;
else do;
if _i1 and _i2 then ;
else do;
_eiter+1;
new_id1_c[_eiter] = id1;
new_id2_c[_eiter] = id2;
new_id[_eiter] = new_id[_i1+_i2]; *only one will be a value;
end;
end;
if eof then do;
do _t = 1 to _eiter;
id1 = new_id1_c[_t];
id2 = new_id2_c[_t];
id3 = new_id[_t];
output;
end;
end;
keep id1 id2 id3;
run;
In this, what I do is for each record, match it up using whichn to the array. If it is an entire-mismatch, it's a new ID; create a new ID. If it's an entire match, move on. If it's a one sided match (id1 is found but id2 is new), make a new row where id2 is added, with id1 and with the id3 that was previously assigned to id1.
The above doesn't actually work perfectly; a later cross on id1 and id2 would cause extra IDs to exist, but it's an example of the basic concept. There are papers easily available on the subject that are too long for an SO answer; for example, Transitive Record Linkage (Glenn Wright, WUSS 2010) is a good example of using hash tables to solve the issue.
Here is an answer which I think will work for all cases, although I don't know how efficient it is:
I'm starting with a data set with a few more observations added to show how the code can handle more difficult cases.
data have;
input year id2 id1;
datalines;
2010 1 201
2010 1 202
2010 2 203
2011 3 202
2011 4 203
2011 5 204
2011 6 205
2011 6 203
2011 7 206
;
run;
First I create a character variable called "links" which has id1 and id2 in it.
data links;
length links $ 20;
set have;
id2t = compress("id2_"||put(id2, 5.));
id1t = compress("id1_"||put(id1, 7.));
links = id2t||" "||id1t;
run;
Then I assign each value of "links" to an element in temporary array "agroup". Next I take each element in agroup which has been assigned a value from "links" and compare it to each subsequent value of agroup. If the values have any id1s or id2s in common, then I concatenate the two together, or, if a value doesn't match any other values, then I put it as the next value of bgroup. Then I go back to the next value of agroup and do the same, repeating the process until I have gone through all the values. At the end there is one value of bgroup for each group, and each value has the id1s and ids2 from all the members. Finally I match up these groups with the original data set to get group numbers.
data final;
array agroup[100] $200. _temporary_;
array bgroup[100] $200. _temporary_;
do until (eof);
set links end=eof;
a+1;
agroup[a] = links;
end;
do k = 1 to a;
found = 0;
do i = k+1 to a until (found=1);
do j = 1 to countw(agroup[k]) until (found = 1);
if find(agroup[i], scan(agroup[k], j)) > 0 then do;
found = 1;
agroup[i] = strip(strip(agroup[k])||' '||strip(agroup[i]));
end;
end;
end;
if found = 0 then do;
b+1;
bgroup[b] = agroup[k];
end;
end;
do until (eof2);
set links end=eof2;
do i = 1 to b;
if find(bgroup[i], id2t)+find(bgroup[i], id1t) > 0 then id3 = i;
end;
output;
end;
keep year id2 id1 id3;
run;