Count rows number by group and subgroup when some subgroup factor is 0 - sas

I know how to count group and subgroup numbers through proc freq or sql. My question is when some factor in the subgroup is missing, and I still want to show missing factor as 0. How can I do that? For example,
the data set is:
group1 group2
1 A
1 A
1 A
1 A
2 A
2 B
2 B
I want a result as:
group1 group2 N
1 A 4
1 B 0
2 A 1
2 B 2
If I only use the default SAS setting, it will usually show as
group1 group2 N
1 A 4
2 A 1
2 B 2
But I still want to the second line in the result tell to me that there are 0 observations in that category.

Use the SPARSE option within proc freq. Consider it a cross join between all options from GROUP1 and GROUP2.
data have;
input group1 group2 $;
cards;
1 A
1 A
1 A
1 A
2 A
2 B
2 B
;
run;
proc freq data=have;
table group1*group2/out=want sparse;
run;
proc print data=want;
run;

Reeza's sparse option works as long as each group is represented in your data at least once. Suppose there were a group1 3 that is not represented in your data, and you would still want them to show up in the frequency table. If that is the case, the solution is to create a reference table with all of your categories then right join your frequency table to it.
Create a reference table:
data ref;
do group1 = 1 to 3;
group2 = 'A';
output;
group2 = 'B';
output;
end;
run;
Create the frequency table with proc sql, right joining to the reference table:
proc sql;
select
r.group1,
r.group2,
count(h.group1) as freq
from
have h
right join ref r
on h.group1 = r.group1
and h.group2 = r.group2
group by
r.group1,
r.group2
order by
r.group1,
r.group2
;
quit;

Another option that's a cross between DWal's issue of "what if the data isn't in the data" and Reeza's One Proc, One Solution, is proc tabulate. If the format contains all possible values, even if the values don't appear, it works, with printmiss.
proc format;
value groupformat
1='Group 1'
2='Group 2'
3='Group 3'
;
quit;
data have;
input group1 group2 $;
cards;
1 A
1 A
1 A
1 A
2 A
2 B
2 B
;
run;
proc tabulate data=have;
class group1 group2/preloadfmt;
format group1 groupformat.;
tables group1*group2,n/printmiss misstext='0';
run;

How to do this via proc summary, using DWal's reference table to specify which combinations of values to use:
data ref;
do group1 = 1 to 3;
group2 = 'A';
output;
group2 = 'B';
output;
end;
run;
data have;
input group1 group2 $1.;
cards;
1 A
1 A
1 A
1 A
2 A
2 B
2 B
;
run;
proc summary nway data = have classdata=ref;
class group1 group2;
output out = summary (drop = _TYPE_);
run;
N.B. I had to tweak the have dataset slightly to make sure that group2 has length 1 in both datasets. If you use variables with the same name but different lengths in your classdata= and data= datasets, SAS will complain.

Related

SAS - Row by row Comparison within different ID Variables of Same Dataset and delete ALL Duplicates

I need some help in trying to execute a comparison of rows within different ID variable groups, all in a single dataset.
That is, if there is any duplicate observation within two or more ID groups, then I'd like to delete the observation entirely.
I want to identify any duplicates between rows of different groups and delete the observation entirely.
For example:
ID Value
1 A
1 B
1 C
1 D
1 D
2 A
2 C
3 A
3 Z
3 B
The output I desire is:
ID Value
1 D
3 Z
I have looked online extensively, and tried a few things. I thought I could mark the duplicates with a flag and then delete based off that flag.
The flagging code is:
data have;
set want;
flag = first.ID ne last.ID;
run;
This worked for some cases, but I also got duplicates within the same value group flagged.
Therefore the first observation got deleted:
ID Value
3 Z
I also tried:
data have;
set want;
flag = first.ID ne last.ID and first.value ne last.value;
run;
but that didn't mark any duplicates at all.
I would appreciate any help.
Please let me know if any other information is required.
Thanks.
Here's a fairly simple way to do it: sort and deduplicate by value + ID, then keep only rows with values that occur only for a single ID.
data have;
input ID Value $;
cards;
1 A
1 B
1 C
1 D
1 D
2 A
2 C
3 A
3 Z
3 B
;
run;
proc sort data = have nodupkey;
by value ID;
run;
data want;
set have;
by value;
if first.value and last.value;
run;
proc sql version:
proc sql;
create table want as
select distinct ID, value from have
group by value
having count(distinct id) =1
order by id
;
quit;
This is my interpretation of the requirements.
Find levels of value that occur in only 1 ID.
data have;
input ID Value:$1.;
cards;
1 A
1 B
1 C
1 D
1 D
2 A
2 C
3 A
3 Z
3 B
;;;;
proc print;
proc summary nway; /*Dedup*/
class id value;
output out=dedup(drop=_type_ rename=(_freq_=occr));
run;
proc print;
run;
proc summary nway;
class value;
output out=want(drop=_type_) idgroup(out[1](id)=) sum(occr)=;
run;
proc print;
where _freq_ eq 1;
run;
proc print;
run;
A slightly different approach can use a hash object to track the unique values belonging to a single group.
data have; input
ID Value:& $1.; datalines;
1 A
1 B
1 C
1 D
1 D
2 A
2 C
3 A
3 Z
3 B
run;
proc delete data=want;
proc ds2;
data _null_;
declare package hash values();
declare package hash discards();
declare double idhave;
method init();
values.keys([value]);
values.data([value ID]);
values.defineDone();
discards.keys([value]);
discards.defineDone();
end;
method run();
set have;
if discards.find() ne 0 then do;
idhave = id;
if values.find() eq 0 and id ne idhave then do;
values.remove();
discards.add();
end;
else
values.add();
end;
end;
method term();
values.output('want');
end;
enddata;
run;
quit;
%let syslast = want;
I think what you should do is:
data want;
set have;
by ID value;
if not first.value then flag = 1;
else flag = 0;
run;
This basically flags all occurrences of a value except the first for a given ID.
Also I changed want and have assuming you create what you want from what you have. Also I assume have is sorted by ID value order.
Also this will only flag 1 D above. Not 3 Z
Additional Inputs
Can't you just do a sort to get rid of the duplicates:
proc sort data = have out = want nodupkey dupout = not_wanted;
by ID value;
run;
So if you process the observations by VALUE levels (instead of by ID levels) then you just need keep track of whether any ID is ever different than the first one.
data want ;
do until (last.value);
set have ;
by value ;
if first.value then first_id=id;
else if id ne first_id then remapped=1;
end;
if not remapped;
keep value id;
run;

SAS transpose columns to row and values to columns

I have a summary table which I want to transpose, but I can't get my head around. The columns should be the rows, and the columns are the values.
Some explanation about the table. Each column represents a year. People can be in 3 groups: A, B or C. In 2016, everyone (100) is in group A. In 2017, 35 are in group A (5 + 20 + 10), 15 in B and 50 in C.
DATA have;
INPUT year2016 $ year2017 $ year2018 $ count;
DATALINES;
A A A 5
A A B 20
A A C 10
A B C 15
A C A 50
;
RUN;
I want to be able to make a nice graph of the evolution of the groups through the different periods. So I want to end up with a table where the columns are the rows (=period) and the columns are the values (= the 3 different groups). Please find an example of the table I want:
Image of table want
I have tried different approaches, but I can't get what I want.
Maybe more direct way but this is probably how I would do it.
DATA have;
INPUT year2016 $ year2017 $ year2018 $ count;
id + 1;
DATALINES;
A A A 5
A A B 20
A A C 10
A B C 15
A C A 50
;
RUN;
proc print;
proc transpose data=have out=want1 name=period;
by id count notsorted;
var year:;
run;
proc print;
run;
proc summary data=want1 nway completetypes;
class period col1;
freq count;
output out=want2(drop=_type_);
run;
proc print;
run;
proc transpose data=want2 out=want(drop=_name_) prefix=Group_;
by period;
var _freq_;
id col1;
run;
proc print;
run;

SAS 9.4 Replacing all values after current line based on current values

I am matching files base on IDs numbers. I need to format a data set with the IDs to be matched, so that the same ID number is not repeated in column a (because column b's ID is the surviving ID after the match is completed). My list of IDs has over 1 million observations, and the same ID may be repeated multiple times in either/both columns.
Here is an example of what I've got/need:
Sample Data
ID1 ID2
1 2
3 4
2 5
6 1
1 7
5 8
The surviving IDs would be:
2
4
5
error - 1 no longer exists
error - 1 no longer exists
8
WHAT I NEED
ID1 ID2
1 2
3 4
2 5
6 5
5 7
7 8
I am, probably very obviously, a SAS novice, but here is what I have tried, re-running over and over again because I have some IDs that are repeated upward of 50 times or more.
Proc sort data=Have;
by ID1;
run;
This sort makes the repeated ID1 values consecutive, so the I could use LAG to replace the destroyed ID1s with the surviving ID2 from the line above.
Data Want;
set Have;
by ID1;
lagID1=LAG(ID1);
lagID2=LAG(ID2);
If NOT first. ID1 THEN DO;
If ID1=lagID1 THEN ID1=lagID2;
KEEP ID1 ID2;
IF ID1=ID2 then delete;
end;
run;
That sort of works, but I still end up with some that end up with duplicates that won't resolve no matter how many times I run (I would have looped it, but I don't know how), because they are just switching back and forth between IDs that have other duplicates (I can get down to about 2,000 of these).
I have figured out that instead of using LAG, I need replace all values after the current line with ID2 for each ID1 value, but I cannot figure out how to do that.
I want to read observation 1, find all later instances of the value of ID1, in both ID1 or ID2 columns, and replace that value with the current observation's ID2 value. Then I want to repeat that process with line 2 and so on.
For the example, I would want to look for any instances after line one of the value 1, and replace it with 2, since that is the surviving ID of that pair - 1 may appear further down multiple times in either of the columns, and I need all them to replaced. Line two would look for later values of 3 and replace them with 4, and so one. The end result should be that an ID number only appears once ever in the ID1 column (though it may appear multiple times in the ID2 column).
ID1 ID2
1 2
3 4
2 5
6 1
1 7
5 8
After first line has been read, data set would look as follows:
ID1 ID2
1 2
3 4
2 5
6 2
2 7
5 8
Reading observation two would make no changes since 3 does not appear again; after observation 3, the set would be:
ID1 ID2
1 2
3 4
2 5
6 5
5 7
5 8
Again, there would be not changes from observation four. but observation 5 would cause the final change:
ID1 ID2
1 2
3 4
2 5
6 5
5 7
7 8
I have tried using the following statement but I can't even tell if I am on the complete wrong track or if I just can't get the syntax figured out.
Data want;
Set have;
Do i=_n_;
ID=ID2;
Replace next var{EUID} where (EUID1=EUID1 AND EUID2=EUID1);
End;
Run;
Thanks for your help!
There is no need to work back and forth thru the data file. You just need to retain the replacement information so that you can process the file in a single pass.
One way to do that is to make a temporary array using the values of the ID variables as the index. That is easy to do for your simple example with small ID values.
So for example if all of the ID values are integers between 1 and 1000 then this step will do the job.
data want ;
set have ;
array xx (1000) _temporary_;
do while (not missing(xx(id1))); id1=xx(id1); end;
do while (not missing(xx(id2))); id2=xx(id2); end;
output;
xx(id1)=id2;
run;
You probably need to add a test to prevent cycles (1 -> 2 -> 1).
For a more general solution you should replace the array with a hash object instead. So something like this:
data want ;
if _n_=1 then do;
declare hash h();
h.definekey('old');
h.definedata('new');
h.definedone();
call missing(new,old);
end;
set have ;
do while (not h.find(key:id1)); id1=new; end;
do while (not h.find(key:id2)); id2=new; end;
output;
h.add(key: id1,data: id2);
drop old new;
run;
Here's an implementation of the algorithm you've suggested, using a modify statement to load and rewrite each row one at a time. It works with your trivial example but with messier data you might get duplicate values in ID1.
data have;
input ID1 ID2 ;
datalines;
1 2
3 4
2 5
6 1
1 7
5 8
;
run;
title "Before making replacements";
proc print data = have;
run;
/*Optional - should improve performance at cost of increased memory usage*/
sasfile have load;
data have;
do i = 1 to nobs;
do j = i to nobs;
modify have point = j nobs = nobs;
/* Make copies of target and replacement value for this pass */
if j = i then do;
id1_ = id1;
id2_ = id2;
end;
else do;
flag = 0; /* Keep track of whether we made a change */
if id1 = id1_ then do;
id1 = id2_;
flag = 1;
end;
if id2 = id1_ then do;
id2 = id2_;
flag = 1;
end;
if flag then replace; /* Only rewrite the row if we made a change */
end;
end;
end;
stop;
run;
sasfile have close;
title "After making replacements";
proc print data = have;
run;
Please bear in mind that as this modifies the dataset in place, interrupting the data step while it is running could result in data loss. Make sure you have a backup first in case you need to roll your changes back.
Seems like this should do the trick and is fairly straight forward. Let me know if it is what you are looking for:
data have;
input id1 id2;
datalines;
1 2
3 4
2 5
6 1
1 7
5 8
;
run;
%macro test();
proc sql noprint;
select count(*) into: cnt
from have;
quit;
%do i = 1 %to &cnt;
proc sql noprint;
select id1,id2 into: id1, :id2
from have
where monotonic() = &i;quit;
data have;
set have;
if (_n_ > input("&i",8.))then do;
if (id1 = input("&id1",8.))then id1 = input("&id2",8.);
if (id2 = input("&id1",8.))then id2 = input("&id2",8.);
end;
run;
%end;
%mend test;
%test();
this might be a little faster:
data have2;
input id1 id2;
datalines;
1 2
3 4
2 5
6 1
1 7
5 8
;
run;
%macro test2();
proc sql noprint;
select count(*) into: cnt
from have2;
quit;
%do i = 1 %to &cnt;
proc sql noprint;
select id1,id2 into: id1, :id2
from have2
where monotonic() = &i;
update have2 set id1 = &id2
where monotonic() > &i
and id1 = &id1;
quit;
proc sql noprint;
update have2 set id2 = &id2
where monotonic() > &i
and id2 = &id1;
quit;
%end;
%mend test2;
%test2();

Create new row to data set based existing ones SAS

I have a dataset looking something like this:
var1 var2 count
cat1 no 1
cat1 yes 4
cat1 unkown 3
cat2 no 7
cat2 yes 3
cat2 unkown 5
cat3 no 2
cat3 yes 9
cat3 unkown 0
What I want to do is combine var1 & var2 into new variable where first row is from var1 and the others from var2. So it supposed to look like:
comb count
cat1
no 1
yes 4
unkown 3
cat2
no 7
yes 3
unkown 5
cat3
no 2
yes 9
unkown 0
Any help would be highly appreciated!
It's quite simple.
Here the solution :
1) create the dataset source:
data testa;
infile datalines dsd dlm=',';
input var1 : $200. var2 : $200. count : 8. ;
datalines;
cat1,no,1,
cat1,yes,4,
cat1,unkown,3,
cat2,no,7,
cat2,yes,3,
cat2,unkown,5,
cat3,no,2,
cat3,yes,9,
cat3,unkown,0,
;
run;
2) Selection of var list : cat1|cat2|cat3
proc sql;
select distinct(var1) into: list_var separated by '|' from testa;
run;
3) Process the var list one by one
%macro processListVar(list_var);
data want;
run;
%let k=1;
%do %while (%qscan(&list_var, &k,|) ne );
%let var = %scan(&list_var, &k,|);
data testb(drop=var1 rename=(var2=comb));
set testa;
N=_N_+1+&k;
where var1="&var";
run;
data testc;
N=1+&k;
comb="&var";
count=.;
run;
data tmp;
set testb testc;
run;
proc sort data=tmp out=teste;
by N;
run;
data want;
set want teste;
run;
%put var=&var;
%let k = %eval(&k + 1);
%end;
%mend processListVar;
%processListVar(&list_var);
4) At the end you get the result in dataset want.
You have to exclude finaly the N column like that :
data want_cleaned (drop=N);
set want;
run;
5) More explanation on the code.
a. The key problem was to keep the order between cat1,cat2,cat3.
b. So I divided the problem by each dataset cat1, cat2, .. and created a %do %while to loop through categories.
c. We use the column N, to count the number of line (like an index), and then we can sort on this column, to keep the order.
d. For example : the first var cat1 : We select the column var2, we rename it like the comb column. We drop the var1 column. It create the testb dataset.
The testb dataset is used to create an index (column N) and we create the first line of our subdataset (N=1+&k) in testc. &k is used through all subdatasets. Like that the index is continuing between subdatasets. (without interfering each others). We make a merge between testb and testc. The dataset tmp contains all info needed for cat1. Then we merge all subdatasets in dataset want.
So to summary, we create a loop, and we merge the datasets together at the end. We make a sort on the column N, to display lines in the order you wanted.
Regards,

Update one dataset with another without using PROC SQL

I have the below two datasets
Dataset A
id age mark
1 . .
2 . .
1 . .
Dataset B
id age mark
2 20 200
1 10 100
I need the below dataset as output
Output Dataset
id age mark
1 10 100
2 20 200
1 10 100
How to carry out this without using PROC SQL i.e. using DATA STEP?
There are many ways to do this. The easiest is to sort the two data sets and then use MERGE. For example:
proc sort data=A;
by id;
run;
proc sort data=B;
by id;
run;
data WANT;
merge A(drop=age mark) B;
by ID;
run;
The trick is to drop the variables you are adding from the first data set A; the new variables will come from the second data set B.
Of course, this solution does not preserve the original order of the observations in your data set AND only works because your second data set contains unique values of id.
I tried this and it worked for me, even if you have data you would like to preserve in that column. Just for completeness sake I added an SQL variant too.
data a;
input id a;
datalines;
1 10
2 20
;
data b;
input id a;
datalines;
1 .
1 5
1 .
2 .
3 4
;
data c (drop=b);
merge a (rename = (a=b) in=ina) b (in = inb);
by id;
if b ne . then a = b;
run;
proc sql;
create table d as
select a.id, a.a from a right join b on a.id=b.id where a.id is not null
union all
select b.id, b.a from a right join b on a.id = b.id where a.id is null
;
quit;