Create new row to data set based existing ones SAS - sas

I have a dataset looking something like this:
var1 var2 count
cat1 no 1
cat1 yes 4
cat1 unkown 3
cat2 no 7
cat2 yes 3
cat2 unkown 5
cat3 no 2
cat3 yes 9
cat3 unkown 0
What I want to do is combine var1 & var2 into new variable where first row is from var1 and the others from var2. So it supposed to look like:
comb count
cat1
no 1
yes 4
unkown 3
cat2
no 7
yes 3
unkown 5
cat3
no 2
yes 9
unkown 0
Any help would be highly appreciated!

It's quite simple.
Here the solution :
1) create the dataset source:
data testa;
infile datalines dsd dlm=',';
input var1 : $200. var2 : $200. count : 8. ;
datalines;
cat1,no,1,
cat1,yes,4,
cat1,unkown,3,
cat2,no,7,
cat2,yes,3,
cat2,unkown,5,
cat3,no,2,
cat3,yes,9,
cat3,unkown,0,
;
run;
2) Selection of var list : cat1|cat2|cat3
proc sql;
select distinct(var1) into: list_var separated by '|' from testa;
run;
3) Process the var list one by one
%macro processListVar(list_var);
data want;
run;
%let k=1;
%do %while (%qscan(&list_var, &k,|) ne );
%let var = %scan(&list_var, &k,|);
data testb(drop=var1 rename=(var2=comb));
set testa;
N=_N_+1+&k;
where var1="&var";
run;
data testc;
N=1+&k;
comb="&var";
count=.;
run;
data tmp;
set testb testc;
run;
proc sort data=tmp out=teste;
by N;
run;
data want;
set want teste;
run;
%put var=&var;
%let k = %eval(&k + 1);
%end;
%mend processListVar;
%processListVar(&list_var);
4) At the end you get the result in dataset want.
You have to exclude finaly the N column like that :
data want_cleaned (drop=N);
set want;
run;
5) More explanation on the code.
a. The key problem was to keep the order between cat1,cat2,cat3.
b. So I divided the problem by each dataset cat1, cat2, .. and created a %do %while to loop through categories.
c. We use the column N, to count the number of line (like an index), and then we can sort on this column, to keep the order.
d. For example : the first var cat1 : We select the column var2, we rename it like the comb column. We drop the var1 column. It create the testb dataset.
The testb dataset is used to create an index (column N) and we create the first line of our subdataset (N=1+&k) in testc. &k is used through all subdatasets. Like that the index is continuing between subdatasets. (without interfering each others). We make a merge between testb and testc. The dataset tmp contains all info needed for cat1. Then we merge all subdatasets in dataset want.
So to summary, we create a loop, and we merge the datasets together at the end. We make a sort on the column N, to display lines in the order you wanted.
Regards,

Related

SAS - Row by row Comparison within different ID Variables of Same Dataset and delete ALL Duplicates

I need some help in trying to execute a comparison of rows within different ID variable groups, all in a single dataset.
That is, if there is any duplicate observation within two or more ID groups, then I'd like to delete the observation entirely.
I want to identify any duplicates between rows of different groups and delete the observation entirely.
For example:
ID Value
1 A
1 B
1 C
1 D
1 D
2 A
2 C
3 A
3 Z
3 B
The output I desire is:
ID Value
1 D
3 Z
I have looked online extensively, and tried a few things. I thought I could mark the duplicates with a flag and then delete based off that flag.
The flagging code is:
data have;
set want;
flag = first.ID ne last.ID;
run;
This worked for some cases, but I also got duplicates within the same value group flagged.
Therefore the first observation got deleted:
ID Value
3 Z
I also tried:
data have;
set want;
flag = first.ID ne last.ID and first.value ne last.value;
run;
but that didn't mark any duplicates at all.
I would appreciate any help.
Please let me know if any other information is required.
Thanks.
Here's a fairly simple way to do it: sort and deduplicate by value + ID, then keep only rows with values that occur only for a single ID.
data have;
input ID Value $;
cards;
1 A
1 B
1 C
1 D
1 D
2 A
2 C
3 A
3 Z
3 B
;
run;
proc sort data = have nodupkey;
by value ID;
run;
data want;
set have;
by value;
if first.value and last.value;
run;
proc sql version:
proc sql;
create table want as
select distinct ID, value from have
group by value
having count(distinct id) =1
order by id
;
quit;
This is my interpretation of the requirements.
Find levels of value that occur in only 1 ID.
data have;
input ID Value:$1.;
cards;
1 A
1 B
1 C
1 D
1 D
2 A
2 C
3 A
3 Z
3 B
;;;;
proc print;
proc summary nway; /*Dedup*/
class id value;
output out=dedup(drop=_type_ rename=(_freq_=occr));
run;
proc print;
run;
proc summary nway;
class value;
output out=want(drop=_type_) idgroup(out[1](id)=) sum(occr)=;
run;
proc print;
where _freq_ eq 1;
run;
proc print;
run;
A slightly different approach can use a hash object to track the unique values belonging to a single group.
data have; input
ID Value:& $1.; datalines;
1 A
1 B
1 C
1 D
1 D
2 A
2 C
3 A
3 Z
3 B
run;
proc delete data=want;
proc ds2;
data _null_;
declare package hash values();
declare package hash discards();
declare double idhave;
method init();
values.keys([value]);
values.data([value ID]);
values.defineDone();
discards.keys([value]);
discards.defineDone();
end;
method run();
set have;
if discards.find() ne 0 then do;
idhave = id;
if values.find() eq 0 and id ne idhave then do;
values.remove();
discards.add();
end;
else
values.add();
end;
end;
method term();
values.output('want');
end;
enddata;
run;
quit;
%let syslast = want;
I think what you should do is:
data want;
set have;
by ID value;
if not first.value then flag = 1;
else flag = 0;
run;
This basically flags all occurrences of a value except the first for a given ID.
Also I changed want and have assuming you create what you want from what you have. Also I assume have is sorted by ID value order.
Also this will only flag 1 D above. Not 3 Z
Additional Inputs
Can't you just do a sort to get rid of the duplicates:
proc sort data = have out = want nodupkey dupout = not_wanted;
by ID value;
run;
So if you process the observations by VALUE levels (instead of by ID levels) then you just need keep track of whether any ID is ever different than the first one.
data want ;
do until (last.value);
set have ;
by value ;
if first.value then first_id=id;
else if id ne first_id then remapped=1;
end;
if not remapped;
keep value id;
run;

SAS: Creating dummy variables from categorical variable

I would like to turn the following long dataset:
data test;
input Id Injury $;
datalines;
1 Ankle
1 Shoulder
2 Ankle
2 Head
3 Head
3 Shoulder
;
run;
Into a wide dataset that looks like this:
ID Ankle Shoulder Head
1 1 1 0
2 1 0 1
3 0 1 1'
This answer seemed the most relevant but was falling over at the proc freq stage (my real dataset is around 1 million records, and has around 30 injury types):
Creating dummy variables from multiple strings in the same row
Additional help: https://communities.sas.com/t5/SAS-Statistical-Procedures/Possible-to-create-dummy-variables-with-proc-transpose/td-p/235140
Thanks for the help!
Here's a basic method that should work easily, even with several million records.
First you sort the data, then add in a count to create the 1 variable. Next you use PROC TRANSPOSE to flip the data from long to wide. Then fill in the missing values with a 0. This is a fully dynamic method, it doesn't matter how many different Injury types you have or how many records per person. There are other methods that are probably shorter code, but I think this is simple and easy to understand and modify if required.
data test;
input Id Injury $;
datalines;
1 Ankle
1 Shoulder
2 Ankle
2 Head
3 Head
3 Shoulder
;
run;
proc sort data=test;
by id injury;
run;
data test2;
set test;
count=1;
run;
proc transpose data=test2 out=want prefix=Injury_;
by id;
var count;
id injury;
idlabel injury;
run;
data want;
set want;
array inj(*) injury_:;
do i=1 to dim(inj);
if inj(i)=. then inj(i) = 0;
end;
drop _name_ i;
run;
Here's a solution involving only two steps... Just make sure your data is sorted by id first (the injury column doesn't need to be sorted).
First, create a macro variable containing the list of injuries
proc sql noprint;
select distinct injury
into :injuries separated by " "
from have
order by injury;
quit;
Then, let RETAIN do the magic -- no transposition needed!
data want(drop=i injury);
set have;
by id;
format &injuries 1.;
retain &injuries;
array injuries(*) &injuries;
if first.id then do i = 1 to dim(injuries);
injuries(i) = 0;
end;
do i = 1 to dim(injuries);
if injury = scan("&injuries",i) then injuries(i) = 1;
end;
if last.id then output;
run;
EDIT
Following OP's question in the comments, here's how we could use codes and labels for injuries. It could be done directly in the last data step with a label statement, but to minimize hard-coding, I'll assume the labels are entered into a sas dataset.
1 - Define Labels:
data myLabels;
infile datalines dlm="|" truncover;
informat injury $12. labl $24.;
input injury labl;
datalines;
S460|Acute meniscal tear, medial
S520|Head trauma
;
2 - Add a new query to the existing proc sql step to prepare the label assignment.
proc sql noprint;
/* Existing query */
select distinct injury
into :injuries separated by " "
from have
order by injury;
/* New query */
select catx("=",injury,quote(trim(labl)))
into :labls separated by " "
from myLabels;
quit;
3 - Then, at the end of the data want step, just add a label statement.
data want(drop=i injury);
set have;
by id;
/* ...same as before... */
* Add labels;
label &labls;
run;
And that should do it!

SAS: Using Do/Loop in a Proc Transpose

I'm not very familiar with Do Loops in SAS and was hoping to get some help. I have data that looks like this:
Product A: 1
Product A: 2
Product A: 4
I'd like to transpose (easy) and flag that Product A: 3 is missing, but I need to do this iteratively to the i-th degree since the number of products is large.
If I run the transpose part in SAS, my first column will be 1, second column will be 2, and third column will be 4 - but I'd really like the third column to be missing and the fourth column to be 4.
Any thoughts? Thanks.
Get some sample data:
proc sort data=sashelp.iris out=sorted;
by species;
run;
Determine the largest column we will need to transpose to. Depending on your situation you may just want to hardcode this value using a %let max=somevalue; statement:
proc sql noprint;
select cats(max(sepallength)) into :max from sorted;
quit;
%put &=max;
Transpose the data using a data step:
data want;
set sorted;
by species;
retain _1-_&max;
array a[1:&max] _1-_&max;
if first.species then do;
do cnt = lbound(a) to hbound(a);
a[cnt] = .;
end;
end;
a[sepallength] = sepallength;
if last.species then do;
output;
end;
keep species _1-_&max;
run;
Notice we are defining an array of columns: _1,_2,_3,..._max. This happens in our array statement.
We then use by-group processing to populate these newly created columns for a single species at a time. For each species, on the first record, we clear the array. For each record of the species, we populate the appropriate element of the array. On the final record for the species output the array contents.
You need a way to tell SAS that you have 4 products and the values are 1-4. In this example I create dummy ID with the needed information then transpose using ID statement to name new variables using the value of product.
data product;
input id product ##;
cards;
1 1 1 2 1 4
2 2 2 3
;;;;
run;
proc print;
run;
data productspace;
if 0 then set product;
do product = 1 to 4;
output;
end;
stop;
run;
data productV / view=productV;
set productspace product;
run;
proc transpose data=productV out=wide(where=(not missing(id))) prefix=P;
by id;
var product;
id product;
run;
proc print;
run;

Count rows number by group and subgroup when some subgroup factor is 0

I know how to count group and subgroup numbers through proc freq or sql. My question is when some factor in the subgroup is missing, and I still want to show missing factor as 0. How can I do that? For example,
the data set is:
group1 group2
1 A
1 A
1 A
1 A
2 A
2 B
2 B
I want a result as:
group1 group2 N
1 A 4
1 B 0
2 A 1
2 B 2
If I only use the default SAS setting, it will usually show as
group1 group2 N
1 A 4
2 A 1
2 B 2
But I still want to the second line in the result tell to me that there are 0 observations in that category.
Use the SPARSE option within proc freq. Consider it a cross join between all options from GROUP1 and GROUP2.
data have;
input group1 group2 $;
cards;
1 A
1 A
1 A
1 A
2 A
2 B
2 B
;
run;
proc freq data=have;
table group1*group2/out=want sparse;
run;
proc print data=want;
run;
Reeza's sparse option works as long as each group is represented in your data at least once. Suppose there were a group1 3 that is not represented in your data, and you would still want them to show up in the frequency table. If that is the case, the solution is to create a reference table with all of your categories then right join your frequency table to it.
Create a reference table:
data ref;
do group1 = 1 to 3;
group2 = 'A';
output;
group2 = 'B';
output;
end;
run;
Create the frequency table with proc sql, right joining to the reference table:
proc sql;
select
r.group1,
r.group2,
count(h.group1) as freq
from
have h
right join ref r
on h.group1 = r.group1
and h.group2 = r.group2
group by
r.group1,
r.group2
order by
r.group1,
r.group2
;
quit;
Another option that's a cross between DWal's issue of "what if the data isn't in the data" and Reeza's One Proc, One Solution, is proc tabulate. If the format contains all possible values, even if the values don't appear, it works, with printmiss.
proc format;
value groupformat
1='Group 1'
2='Group 2'
3='Group 3'
;
quit;
data have;
input group1 group2 $;
cards;
1 A
1 A
1 A
1 A
2 A
2 B
2 B
;
run;
proc tabulate data=have;
class group1 group2/preloadfmt;
format group1 groupformat.;
tables group1*group2,n/printmiss misstext='0';
run;
How to do this via proc summary, using DWal's reference table to specify which combinations of values to use:
data ref;
do group1 = 1 to 3;
group2 = 'A';
output;
group2 = 'B';
output;
end;
run;
data have;
input group1 group2 $1.;
cards;
1 A
1 A
1 A
1 A
2 A
2 B
2 B
;
run;
proc summary nway data = have classdata=ref;
class group1 group2;
output out = summary (drop = _TYPE_);
run;
N.B. I had to tweak the have dataset slightly to make sure that group2 has length 1 in both datasets. If you use variables with the same name but different lengths in your classdata= and data= datasets, SAS will complain.

How to change table structure in SAS?

I have a dataset that has columns like:
a|b|c|d|e
and rows like:
1|3|5|7|9
2|4|6|8|10
How to change it to:
Char|Num|
a|1
a|2
b|3
b|4
c|5
c|6
d|7
d|8
e|9
e|10
Thank you in advance!
You can use PROC TRANSPOSE. The only gotcha is to get what you want you need a BY variable. Easiest thing is to add a record number and use that as your BY.
data have;
input a b c d;
i = _n_;
datalines;
1 2 3 4
5 6 7 8
;
run;
proc transpose data=have out=want(drop=i);
by i;
var a b c d;
run;