I have the below two datasets
Dataset A
id age mark
1 . .
2 . .
1 . .
Dataset B
id age mark
2 20 200
1 10 100
I need the below dataset as output
Output Dataset
id age mark
1 10 100
2 20 200
1 10 100
How to carry out this without using PROC SQL i.e. using DATA STEP?
There are many ways to do this. The easiest is to sort the two data sets and then use MERGE. For example:
proc sort data=A;
by id;
run;
proc sort data=B;
by id;
run;
data WANT;
merge A(drop=age mark) B;
by ID;
run;
The trick is to drop the variables you are adding from the first data set A; the new variables will come from the second data set B.
Of course, this solution does not preserve the original order of the observations in your data set AND only works because your second data set contains unique values of id.
I tried this and it worked for me, even if you have data you would like to preserve in that column. Just for completeness sake I added an SQL variant too.
data a;
input id a;
datalines;
1 10
2 20
;
data b;
input id a;
datalines;
1 .
1 5
1 .
2 .
3 4
;
data c (drop=b);
merge a (rename = (a=b) in=ina) b (in = inb);
by id;
if b ne . then a = b;
run;
proc sql;
create table d as
select a.id, a.a from a right join b on a.id=b.id where a.id is not null
union all
select b.id, b.a from a right join b on a.id = b.id where a.id is null
;
quit;
Related
I want merge two tables, but they have 2 columns in commun, and i do not want value of var1 in A replaced by that in B, if we don't use drop or rename, does anyone know it?
I can fix it with sql but just curious with Merge!
data a;
infile datalines;
input id1 $ id2 $ var1;
datalines;
1 a 10
1 b 10
2 a 10
2 b 10
;
run;
/* create table B */
data b;
infile datalines;
input id1 $ id2 $ var1 var2;
datalines;
1 a 30 50
2 b 30 50
;
run;
/* Marge A and B */
data c;
merge a (in=N) b(in=M);
if N;
by id1;
run;
but what i like is:
data C;
infile datalines;
input id1 $ id2 $ var1 var2;
datalines;
1 a 10 50
1 b 10 50
2 a 10 50
2 b 10 50
;
run;
Use rename
data c;
merge a (in=N) b(in=M rename=(var1=var1_2));
by id1;
if N;
run;
If you don't want to use rename / drop etc., then you could just flip the merge order such that the datasets whose var1 should be retained overwrites the other:
data c;
merge b (in=M) a(in=N);
by id1;
if N;
run;
When the data step loads data from the datasets mentioned it does it in the order that they appear on the MERGE (or SET or UPDATE) statement. So if you are merging two dataset and the BY variables match values then the record from the first is loaded and the record from the second is loaded, overwriting the values read from the first.
For 1 to 1 matching you can just change the order that the datasets are mentioned.
merge b(in=M) a(in=N) ;
If you really want the variables defined in the output dataset in the order they appear in A then add a SET statement that the compiler will process but that can never execute before your MERGE statement.
if 0 then set a b ;
If you are doing a 1 to many matching then you might have other trouble since when a dataset stops contributing values to the current BY group then SAS does not re-read the last observation. In that case you will have to use some combination of RENAME=, DROP= or KEEP= dataset options.
In PROC SQL when you have duplicate names for selected columns (and are trying to create an output dataset instead of report) then SAS ignores the second copy of the named variable. So in a sense it is the reverse of what happens with the MERGE statement.
data a1
col1 col2 flag
a 2 .
b 3 .
a 4 .
c 1 .
For data a1, flag is always missing. I want to update multiple rows using a2.
data a2
col1 flag
a 1
Ideal output:
col1 col2 flag
a 2 1
b 3 .
a 4 1
c 1 .
But this doesn't update all the records in by statement.
data a1;
modify a1 a2;
by col1;
run;
Question edited
Actually a1 is a very large data set on server. Hence I prefer to modify it (if possible) instead of creating a new one. Otherwise I have to drop previous a1 first and copy a new a1 from local to server, which will take much more time.
If you want to do this with MODIFY, you have to loop over the modify dataset in some fashion or it will only replace the first row (because the other dataset will then run out of records - normally this behaves like merge, where once it finds a match it advances to next record). Here's one option - there are others.
data a1(index=(col1));
input col1 $ col2 flag;
datalines;
a 2 .
b 3 .
a 4 .
c 1 .
;;;;
run;
data a2(index=(col1));
col1='a';
flag=1;
run;
data a1;
set a2(rename=flag=flag2);
do _n_ = 1 to nobs_a1;
modify a1 key=col1 nobs=nobs_a1;
if _iorc_=0 then do;
flag=flag2;
replace;
end;
end;
if _iorc_=%sysrc(_DSENOM) then _error_=0;
run;
If you're not using Merge statement for the sorting problem, you can simply change your merging approach.
If flag in A1 is always missing, you can drop it, otherwise you should temporary rename it for not losing those informations.
Here I will merge A1 and A2 using hash objects, this approach doesn't require any prior sorting on datasets.
data final_merged(drop = finder);
length flag 8.; /*please change length with the real one, use $ if char*/
if _N_ = 1 then do;
declare hash merger(dataset:'A2');
merger.definekey('col1');
merger.DefineData ('flag');
merger.definedone();
end;
set A1(drop=flag);
finder = merger.find();
if finder ne 0 then flag = .;
/*then flag='' or then flag='unknown' as you want if flag is a character var*/
run;
Please, let me know if this will help.
You could do the following but SQL sorts the observations so not sure how useful this would be for you? (you could always preprocess with ordvar=_n_; and then sort the SQL statement on it if that helps):
Data:
data a1 ;
input col1 $ col2 flag ;
cards ;
a 2 .
b 3 .
a 4 .
c 1 .
;run ;
data a2 ;
input col1 $ flag ;
cards ;
a 1
;run ;
Merge:
proc sql ;
create table output as
select a.col1, a.col2, b.flag
from a1 a
left join
a2 b
on a.col1=b.col1
;quit ;
To try and do it in one pass, how about creating two macros variables containing the mapping from a2?
proc sql ;
select distinct col1, flag
into :colvals separated by '', :flagvals separated by ''
from a2
;quit ;
Set flag to the corresponding character position between the two macro variables:
data a1 ;
set a1 ;
if findc("&colvals",col1) then
flag=input(substr("&flagvals", findc("&colvals",col1),1),8.) ;
run ;
I want to use dataset B to overwrite some values in dataset A by merging dataset A & B with a merging ID. However it doesn't work as expected. Here is the test I did:
/* create table A */
data a;
infile datalines;
input id1 $ id2 $ var1;
datalines;
1 a 10
1 b 10
2 a 10
2 b 10
;
run;
/* create table B */
data b;
infile datalines;
input id1 $ var1 var2;
datalines;
1 20 30
2 20 30
;
run;
/* merge A&B to overwrite var1 in table A using values in table B */
data c;
merge a b;
by id1;
run;
Table C looks like this:
ID1 ID2 VAR1 VAR2
1 a 20 30
1 b 10 30
2 a 20 30
2 b 10 30
Why the 10s in row 2&4 didn't get replaced by 20 from table B? While var2 works as expected?
I know I can do this simply using proc SQL, and that's what I did to solve the problem. But I still quite curious if there is a way to do what I wanted using merge? And why this wasn't working? I prefer merge over SQL in this circumstance because the logic is easier to implement (util I found this not working properly).
I use SAS 9.4.
This has to do with how SAS iterates over the data sets during the merge. Basically, the second record for each of A doesn't get lined up with a record from B. The value of VAR2 is carried over from the previous record. VAR1 gets its value from A (because there is no B).
IF there is record in B for EVERY ID1, then you can rewrite your merge like this to achieve what you want.
/* merge A&B to overwrite var1 in table A using values in table B */
data c;
merge a(drop=var1) b;
by id1;
run;
This drops the VAR1 from A so that it is carried down from the record in B.
Otherwise you will need more complex logic (might I suggest an SQL left join with the coalesce() function?).
Like DomPazz suggests, proc sql is the way to do this. merge will only keep one value from each data set. The coalesce function pick the first non-missing value from the list, so it uses var1 from b, but if b.var1 is null then it uses a.var1.
proc sql;
create table c as
select
a.id1,
a.id2,
coalesce(b.var1,a.var1) as var1,
b.var2
from
a
left join b
on a.id1 = b.id1
;
quit;
The merge method could still work fine, you would just need to be more explicit about how to choose the 'best' value for var1, such as:
data c (drop = a_var1 b_var1);
merge a(rename=(var1 = a_var1))
b(rename=(var1 = b_var1));
by id1;
* Now you have two different variables named a_var1 and b_var1;
* Implement logic to choose your favorite;
if NOT MISSING(b_var1) Then DO;
var1 = b_var1;
var1_source='B';
END;
else DO;
var1 = a_var1;
var1_source='A';
END;
run;
If your criteria for which 'var1' to choose is as simple as 'If b exists, use it' then this is identical to the the SQL method with coalesce().
Where I've found this method useful is for more complicated criteria, plus its always nice to know the source of the data (which doesn't happen with coalesce()).
I know how to count group and subgroup numbers through proc freq or sql. My question is when some factor in the subgroup is missing, and I still want to show missing factor as 0. How can I do that? For example,
the data set is:
group1 group2
1 A
1 A
1 A
1 A
2 A
2 B
2 B
I want a result as:
group1 group2 N
1 A 4
1 B 0
2 A 1
2 B 2
If I only use the default SAS setting, it will usually show as
group1 group2 N
1 A 4
2 A 1
2 B 2
But I still want to the second line in the result tell to me that there are 0 observations in that category.
Use the SPARSE option within proc freq. Consider it a cross join between all options from GROUP1 and GROUP2.
data have;
input group1 group2 $;
cards;
1 A
1 A
1 A
1 A
2 A
2 B
2 B
;
run;
proc freq data=have;
table group1*group2/out=want sparse;
run;
proc print data=want;
run;
Reeza's sparse option works as long as each group is represented in your data at least once. Suppose there were a group1 3 that is not represented in your data, and you would still want them to show up in the frequency table. If that is the case, the solution is to create a reference table with all of your categories then right join your frequency table to it.
Create a reference table:
data ref;
do group1 = 1 to 3;
group2 = 'A';
output;
group2 = 'B';
output;
end;
run;
Create the frequency table with proc sql, right joining to the reference table:
proc sql;
select
r.group1,
r.group2,
count(h.group1) as freq
from
have h
right join ref r
on h.group1 = r.group1
and h.group2 = r.group2
group by
r.group1,
r.group2
order by
r.group1,
r.group2
;
quit;
Another option that's a cross between DWal's issue of "what if the data isn't in the data" and Reeza's One Proc, One Solution, is proc tabulate. If the format contains all possible values, even if the values don't appear, it works, with printmiss.
proc format;
value groupformat
1='Group 1'
2='Group 2'
3='Group 3'
;
quit;
data have;
input group1 group2 $;
cards;
1 A
1 A
1 A
1 A
2 A
2 B
2 B
;
run;
proc tabulate data=have;
class group1 group2/preloadfmt;
format group1 groupformat.;
tables group1*group2,n/printmiss misstext='0';
run;
How to do this via proc summary, using DWal's reference table to specify which combinations of values to use:
data ref;
do group1 = 1 to 3;
group2 = 'A';
output;
group2 = 'B';
output;
end;
run;
data have;
input group1 group2 $1.;
cards;
1 A
1 A
1 A
1 A
2 A
2 B
2 B
;
run;
proc summary nway data = have classdata=ref;
class group1 group2;
output out = summary (drop = _TYPE_);
run;
N.B. I had to tweak the have dataset slightly to make sure that group2 has length 1 in both datasets. If you use variables with the same name but different lengths in your classdata= and data= datasets, SAS will complain.
I have a dataset similar to the one below
ID A B C D E
1 1
1 1
1 1
2 1
2 1
3 1
3 1
4 1
5 1
I want to condense the data into one row for each ID. So the dataset would look like the one below.
ID A B C D E
1 1 1 1
2 1 1
3 1 1
4 1
5 1
Well I created another table and removed the duplicate ID's. So I have two tables--A and B. I then tried merging the two datasets together. I was playing around with following SAS code.
data C;
merge A B;
by ID;
run;
Here's a neat trick I picked up from another forum. There's no need to split up the original dataset, the first update statement creates the structure and the second updates the values. The BY statement ensures you only get 1 record per ID.
data have;
infile datalines dsd;
input ID A B C D E;
datalines;
1,1,,,,,
1,,,1,,,
1,,1,,,,
2,,1,,,,
2,,,,1,,
3,,,,,1,
3,1,,,,,
4,,,1,,,
5,,1,,,
;
run;
data want;
update have (obs=0) have;
by id;
run;
This could be solved using the retain statement.
data B(rename=(A2=A B2=B C2=C D2=D));
set A;
by id;
retain A2 B2 C2 D2;
if first.id then do;
A2 = .;
B2 = .;
C2 = .;
D2 = .;
end;
if A ne . then A2=A;
if B ne . then B2=B;
if C ne . then C2=C;
if D ne . then D2=D;
if last.id then output;
drop A B C D;
run;
There are other ways to solve this, but hopefully this is helpful.
PROC MEANS is a great tool for something like this. PROC SQL would also give you a reasonable solution, but MEANS is faster.
proc means data=yourdata;
var a b c d e;
class id;
types id; *to avoid the 'overall' row;
output out=yourdata max=; *output the maximum of each var for each ID - use SUM instead if you want more than 1;
run;