Concatenate duplicate values - sas

I have a table with some variables, say var1 and var2 and an identifier, and for some reasons, some identifiers have 2 observations.
I would like to know if there is a simple way to put back the second observation of the same identifier into the first one, that is
instead of having two observations, each with var1 var2 variables for the same identifier value
ID var1 var2
------------------
A1 12 13
A1 43 53
having just one, but with something like var1 var2 var1_2 var2_2.
ID var1 var2 var1_2 var2_2
--------------------------------------
A1 12 13 43 53
I can probably do that with renaming all my variables, then merging the table with the renamed one and dropping duplicates, but I assume there must be a simpler version.

Actually, your suggestion of merging the values back is probably the best.
This works if you have, at most, 1 duplicate for any given ID.
data first dups;
set have;
by id;
if first.id then output first;
else output dups;
run;
proc sql noprint;
create table want as
select a.id,
a.var1,
a.var2,
b.var1 as var1_2,
b.var2 as var2_2
from first as a
left join
dups as b
on a.id=b.id;
quit;

Another method makes use of PROC TRANSPOSE and a data-step merge:
/* You can experiment by adding more data to this datalines step */
data have;
infile datalines;
input ID : $2. var1 var2;
datalines;
A1 12 13
A1 43 53
;
run;
/* This step puts the var1 values onto one line */
proc transpose data=tab out=new1 (drop=_NAME_) prefix=var1_;
by id;
var var1;
run;
/* This does the same for the var2 values */
proc transpose data=tab out=new2 (drop=_NAME_) prefix=var2_;
by id;
var var2;
run;
/* The two transposed datasets are then merged together to give one line */
data want;
merge new1 new2;
by id;
run;
As an example:
data tab;
infile datalines;
input ID : $2. var1 var2;
datalines;
A1 12 13
A1 43 53
A2 199 342
A2 1132 111
A2 91913 199191
B1 1212 43214
;
run;
Gives:
ID var1_1 var1_2 var1_3 var2_1 var2_2 var2_3
---------------------------------------------------
A1 12 43 . 13 53 .
A2 199 1132 91913 342 111 199191
B1 1212 . . 43214 . .

There's a very simple way of doing this, using the IDGROUP function within PROC SUMMARY.
data have;
input ID $ var1 $ var2 $;
datalines;
A1 12 13
A1 43 53
;
run;
proc summary data=have nway;
class id;
output out=want (drop=_:)
idgroup(out[2] (var1 var2)=);
run;

Related

Add new empty rows to a SAS table with names from another table

Assume I have table foo which contains a (dynamic) list of new rows which I want to add to another table have, so that it yields a table want looking e.g. like this:
x y p_14 p_15
1 2 2 99
2 4 7 24
Example data for foo:
id row_name
14 p_14
15 p_15
Example data for have:
x y p Z
1 2 14 2
1 2 15 99
1 2 16 59
2 4 14 7
2 4 15 24
2 4 16 58
What I have so far is the following which is not yet in macro shape:
proc sql;
create table want as
select old.*, t1.p_14, t2.p_15 /* choosing non-duplicate rows */
from (select x, y from have) old
left join (select x, y, z as p_14 from have where p=14) t1
on old.x=t1.x and old.y=t1.y
left join (select x, y, z as p_15 from have where p=15) t2
on old.x=t2.x and old.y=t2.y
;
quit;
Ideally, I am aiming for a macro where which takes foo as input and automatically creates all the joins from above. Also, the solution should not spit out any warnings in the console. My challenge is how to dynamically choose the correct (non-duplicate) rows.
PS: This is a follow-up question of Populate SAS macro-variable using a SQL statement within another SQL statement? The important bit is that it is not a full transpose, I guess.
You can go from HAVE to WANT with PROC TRANSPOSE.
proc transpose data=have out=want(drop=_name_) prefix=p_ ;
by x y ;
id p ;
var z;
run;
To limit it to the values of P that occur in FOO you could use a macro variable (as long as the number of observations in FOO is small enough).
proc sql noprint ;
select id into :idlist separated by ' ' from foo ;
quit;
proc transpose data=have out=want(drop=_name_) prefix=p_ ;
where p in (&idlist) ;
by x y ;
id p ;
var z;
run;
If the issue is you want variable P_17 to be in the result even if 17 does not appear in HAVE then add a little more complexity. For example add another data step that will force the creation of the empty variables. You can generate the list of variable names from the list of id's in FOO.
proc sql noprint ;
select id , cats('p_',id)
into :idlist separated by ' '
, :varlist separated by ' '
from foo
;
quit;
proc transpose data=have out=want(drop=_name_) prefix=p_ ;
where p in (&idlist) ;
by x y ;
id p ;
var z;
run;
data want ;
set want (keep=x y);
array all &varlist ;
set want ;
run;
Results:
Obs x y p_14 p_15 p_17
1 1 2 2 99 .
2 2 4 7 24 .
If the number of values is too large to store in a single macro variable (limit 64K bytes) you could generate the WHERE statement with a data step to a file and use %INCLUDE to add the WHERE statement into the code.
filename where temp;
data _null_;
set foo end=eof;
file where ;
if _n_=1 then put 'where p in (' #;
put id # ;
if eof then put ');' ;
run;
proc transpose ... ;
%include where / source2;
...
Use macro program:
data have;
input x y p Z;
cards;
1 2 14 2
1 2 15 99
1 2 16 59
2 4 14 7
2 4 15 24
2 4 16 58
;
data foo;
input id row_name $;
cards;
14 p_14
15 p_15
;
%macro test(dsn);
proc sql;
select count(*) into:n trimmed from &dsn;
select id into: value separated by ' ' from &dsn;
create table want as
select distinct a.x,a.y,
%do i=1 %to &n;
%let cur=%scan(&value,&i);
t&i..p_&cur
%if &i<&n %then ,;
%else ;
%end;
from have a
%do i=1 %to &n;
%let cur=%scan(&value,&i);
left join have (where=(p=&cur) rename=(z=p_&cur.)) t&i.
on a.x=t&i..x and a.y=t&i..y
%end;
;
quit;
%mend;
%test(foo);

SAS replace character in ALL columns

I have a SAS dataset that I have to export to a .csv-file. I have the following two contradicting requirements.
I have to use the semicolon as the delimiter in the .csv-file.
Some of the character variables are manually inputted strings from formulas, hence they may contain semicolons.
My solution to the above is to either escape the semicolon or to replace it with a comma.
How can I, in a nice, clean and efficient way use e.g. tranwrd on an entire dataset?
My attempt:
For each variable, use the tranwrd(.., ";", ",") function on a variable in the data set. Update the dataset and loop through all variables. This, however, is naturally a very inefficient way of doing it for even semi-large datasets, since I have to do a datastep for each variable. The code for it is quite ugly, since I have to get the variable names by a few steps, but the inefficiency definitely takes the cake.
data test;
input w $ c b d e $ f $;
datalines4;
Aaa;; 50 11 1 222 a;s
Bbb 35 12 2 250 qw
Comma, 75 13 3 foo zx
;;;;
run;
* Get the variable names;
proc contents data=test out=vars(keep=name type varnum) order=varnum noprint;
run;
* Sort by variable number;
proc sort data=vars;
by varnum;
run;
* Put variable names into a space-separated string;
proc sql noprint;
select compress(name)
into :name_list separated by ' '
from vars;
quit;
%let len = %sysfunc(countw(&name_list));
*Initialize loop dataset;
data a;
set test;
run;
%macro loop;
%do i = 1 %to &len;
%let j = %scan(&name_list,&i);
data a(rename=(v_&j = &j) drop=&j);
set a;
v_&j.=compress(tranwrd(&j,";",","));
run;
%end;
%mend;
%loop;
I think I may have more elegant solution to your problem:
data class;
set sashelp.class;
array vars [*] _character_;
do i = 1 to dim(vars);
vars[i] = compress(tranwrd(vars[i],"a","X"));
end;
drop i;
run;
You can use array to reference all character columns from your data set and then loop through them.
The most widely used standard for csv files whose fields can contain delimiters is to quote fields that contain them, and double up any quotes. In SAS you can do this automatically using the dlm and dsd options in a put statement:
data test;
input w $ c b d e $ f $;
datalines4;
Aaa;; 50 11 1 222 a;s
Bbb" 35 12 2 250 qw
Comma, 75 13 3 foo zx
;;;;
run;
data _null_;
set test;
file "c:\temp\test.csv" dsd dlm=';';
put (_ALL_) (&);
run;
This results in the following semicolon-delimited csv (minus a header row, but that's a separate issue):
"Aaa;;";50;11;1;222;"a;s"
"Bbb""";35;12;2;250;qw
Comma,;75;13;3;foo;zx
Sorry, didn't notice your comment about the workaround until after I posted this. I'll leave it here in case anyone finds it helpful.
Fields in a properly formatted delimited file are quoted. PROC EXPORT will do that. There is no need to change the data.
data test;
input w $ c b d e $ f $;
datalines4;
Aaa;; 50 11 1 222 a;s
Bbb 35 12 2 250 qw
Comma, 75 13 3 foo zx
;;;;
run;
filename FT45F001 temp;
proc export data=test outfile=FT45F001 dbms=csv;
delimiter=';';
run;
data _null_;
infile FT45F001;
input;
list;
run;
proc import replace datafile=FT45F001 dbms=csv out=test2;
delimiter=';';
run;
proc print;
run;
proc compare base=test compare=test2;
run;

SAS Delete group if there are missing observations

In my sas data set there are groups, i.e. id and I want delete groups with missing values in a certain variable.
For example I have this sas data set:
data have;
input v1 v2 v3 id;
datalines;
9 7 210 1
0 6 . 1
9 3 320 2
6 1 . 1
9 4 432 2
;
run;
I tried this:
/*Order by id*/
proc sort data=have;
by id;
run;
/*Select no missing observations by id*/
data=want;
set=have;
if cmiss(of _all_) then delete;
run;
However this code does not exclude id's with missing values. It delete missing values.
Hmmm. You can use proc sql for this:
proc sql;
delete from have
where exists (select 1 from have have2 where have.id = have2.id and (have2.v1 is null or have2.v2 is null or have2.v3 is null);
One idea might be to use a double DOW loop. First to check for any missing values and then a second one to output the records for the ids with no missing values.
data have;
input v1 v2 v3 id;
datalines;
9 7 210 1
0 6 . 1
9 3 320 2
6 1 . 1
9 4 432 2
1 2 333 3
;
You will need to sort as in your example.
data want ;
do until (last.id);
set have;
by id;
anymissing=max(anymissing,cmiss(of v1-v3));
end;
do until (last.id);
set have;
by id;
if not anymissing then output;
end;
run;
You just dont want to have lines with missing Columns in your result dataset. So why delete, just exclude them when writing result-dataset or overwrite source-Dataset.:
data have;/*overwriting my have dataset instead of deleting lines*/
set have;
if not cmiss(of _ALL_);
run;
When you want to remove all lines for a group if only one line has a missing value you can do this, Store an ID if it has no value and then dont write any line with that id, and you just get ID lines you want as result. Important is that the ID with missing value is first in dataset, but that should be that way because of proc sort:
data want;
retain x;
set have;
if cmiss(of _ALL_) then
x= id;
if x ne id;
run;

Count over columns in SAS

I have a data set in SAS containing individuals as rows and a variable for each period as columns. It looks something like this:
data have;
input individual t1 t2 t3;
cards;
1 112 111 123
2 112 111 123
3 111 111 123
4 112 112 111
;
run;
What I want is for SAS to count how many there is of each number for each time period. So I want to get something like it:
data want;
input count t1 t2 t3;
cards;
111 1 3 1
112 3 1 0
123 0 0 3
;
run;
I could do this with proc freq, but outputting this doesn't work very well, when I have a lot of columns.
Thanks
In general having data in the meta data is a bad idea, as here where PERIOD is coded into the Tn variables and you really want that to be a group. Having said that you can still have your cake and eat it too.
PROC SUMMARY can get the counts for each Tn quickly and then you will have smaller data set to fiddle with. Here is one approach that should work well for many time periods.
data have;
input individual t1 t2 t3;
cards;
1 112 111 123
2 112 111 123
3 111 111 123
4 112 112 111
;;;;
run;
proc print;
run;
proc summary data=have chartype;
class t:;
ways 1;
output out=want;
run;
proc print;
run;
data want;
set want;
p = findc(_type_,'1');
c = coalesce(of t1-t3);
run;
proc print;
run;
proc summary data=want nway completetypes;
class c p;
freq _freq_;
output out=final;
run;
proc print;
run;
proc transpose data=final out=morefinal(drop=_name_) prefix=t;
by c;
id p;
var _freq_;
run;
proc print;
run;
First restructure the data so that it is in more of a vertical fashion. This will be easier to work with. We also want to create a flag that we will use as a counter later on.
data have2;
set have;
array arr[*] t1-t3;
flag = 1;
do period=lbound(arr) to hbound(arr);
val = arr[period];
output;
end;
keep period val flag;
run;
Summarize the data so we have the number of times that value occurred in each of the periods.
proc sql noprint;
create table smry as
select val,
period,
sum(flag) as count
from have3
group by 1,2
order by 1,2
;
quit;
Transpose the data so we have one line per value and then the counts for each period after that:
proc transpose data=smry out=want(drop=_name_);
by val;
id period;
var count;
run;
Note that when you define the array in the first step you could use this notation which would allow for a dynamic number of periods:
array arr[*] t:;
This assumes every variable beginning with 't' in the dataset should go into the array.
If your computer memory is large enough to hold the entire output, then Hash could be a viable solution:
data have;
input individual t1 t2 t3;
cards;
1 112 111 123
2 112 111 123
3 111 111 123
4 112 112 111
;
run;
data _null_;
if _n_=1 then
do;
/*This is to construct a Hash, where count is tracked and t1-t3 is maintained*/
declare hash h(ordered:'a');
h.definekey('count');
h.definedata('count', 't1','t2','t3');
h.definedone();
call missing(count, t1,t2,t3);
end;
set have(rename=(t1-t3=_t1-_t3))
/*rename to avoid conflict between input data and Hash object*/
end=last;
array _t(*) _t:;
array t(*) t:;
/*The key is to set up two arrays, one is for input data,
another is for Hash feed, and maneuver their index variable accordingly*/
do i=1 to dim(_t);
count=_t(i);
rc=h.find(); /*search the Hash and bring back data elements if found*/
/*If there is a match, then corresponding 't' will increase by '1'*/
if rc=0 then
t(i)+1;
else
do;
/*If there is no match, then corresponding 't' will be initialized as '1',
and all of the other 't' reset to '0'*/
do j=1 to dim(t);
t(j)=0;
end;
t(i)=1;
end;
rc=h.replace(); /*Update the Hash*/
end;
if last then
rc=h.output(dataset:'want');
run;
Try this:
%macro freq(dsn);
proc sql;
select name into:name separated by ' ' from dictionary.columns where libname='WORK' and memname='HAVE' and name like 't%';
quit;
%let ncol=%sysfunc(countw(&name,%str( )));
%do i=1 %to &ncol;
%let col=%scan(&name,&i);
proc freq data=have;
table &col/out=col_&i(keep=&col count rename=(&col=count count=&col));
run;
%end;
data temp;
merge
%do i=1 %to &ncol;
col_&i
%end;
;
by count;
run;
data want;
set temp;
array vars t:;
do over vars;
if missing(vars) then vars=0;
end;
run;
%mend;
%freq(have)

Append multiple variable from same dataset into single variable

Given the following dataset:.
obs var1 var2 var3
1 123 456 .
2 123 . 789
3 . 456 789
How does one go about to append all the variables into a single variable whilst ignoring the empty observations (denoted by ".")?
Desired output:.
obs var4
1 123
2 123
3 456
4 456
5 789
6 789
Data step:.
data have;
input
var1 var2 var3; cards;
123 456 .
123 . 789
. 456 789
;run;
Not sure why you read the numbers in as char, but if I change to num, it could be done like this:
data have;
input var1 var2 var3;
cards;
123 456 .
123 . 789
. 456 789
;run;
data want (keep=var4);
set have;
var4=var1;if var4 ne . then output;
var4=var2;if var4 ne . then output;
var4=var3;if var4 ne . then output;
run;
OK, let's assume you have a file vith the values in it, and you do not know how many variables are in each row. First I need to create a sample textfile:
filename x temp;
data _nulL_;
file x;
put "123 456 . ";
put "123 . 789 ";
put ". 456 789 ";
run;
Then I need to read the first line and count the number of variables:
data _null_;
infile x;
input;
call symputx("number_of_variables",put(countw(_infile_," ","c"),best.));
stop;
run;
%put &number_of_variables;
Now I can dynamically read the variables:
%macro doit();
data have;
infile x;
input
%do i=1 %to &number_of_variables;
var&i
%end;
;
run;
data want (keep=var%eval(&number_of_variables + 1));
set have;
%do i=1 %to &number_of_variables;
var%eval(&number_of_variables + 1)=var&i;
if var%eval(&number_of_variables + 1) ne . then output;
%end;
run;
%mend;
%doit;
You can use proc transpose to do this but there is a trick to doing so. You will need to append a unique identifier to each row, prior to doing the transpose.
I've taken #Stig's sample data and added the observation number to use as a unique identifier:
data have;
input var1 var2 var3;
x = _n_; * ADDING A UNIQUE IDENTIFIER TO EVERY ROW;
cards;
123 456 .
123 . 789
. 456 789
;run;
Then it's simply a case of running proc transpose:
proc transpose data=have out=xx;
by x;
run;
And finally, remove any results where col1 is missing, and add in the observation number:
data want;
obs = _n_;
set xx (keep=col1);
where col1 ne .;
run;
As the order is not important then you can do this in one step, using arrays. As the data step moves through each row, the array enables the variable values to be stored in memory, so you can loop through them. I've set it up so that each time a non-missing value is found, then output it to the new variable.
In creating the array, I've set it to var1--var3, the double dash means all variables between var1 and var3 inclusive. If your real variables are numbered the same way then you can use var1-var3, which means all sequential numbers between the two variables.
data have;
input var1 var2 var3;
datalines;
123 456 .
123 . 789
. 456 789
;
run;
data want;
set have;
array allnums var1--var3;
do i = 1 to dim(allnums);
if not missing(allnums{i}) then do;
var4 = allnums{i};
output;
end;
end;
drop var1--var3 i;
run;