I want to use proc compare to update dataset on a daily basis.
work.HAVE1
Date Key Var1 Var2
01Aug2013 K1 a 2
01Aug2013 K2 a 3
02Aug2013 K1 b 4
work.HAVE2
Date Key Var1 Var2
01Aug2013 K1 a 3
01Aug2013 K2 a 3
02Aug2013 K1 b 4
03Aug2013 K2 c 1
Date and Key are uniquely determine one record.
How can I use the above two tables to construct the following
work.WANT
Date Key Var1 Var2
01Aug2013 K1 a 3
01Aug2013 K2 a 3
02Aug2013 K1 b 4
03Aug2013 K2 c 1
I don't want to delete the previous data and then rebuild it. I want to modify it via append new records at the bottom and adjust the values in VAR1 or VAR2.
I'm struggling with proc compare but it just doesn't return what I want.
proc compare base=work.HAVE1 compare=work.HAVE2 out=WORK.DIFF outnoequal outcomp;
id Date Key;
run;
This will give you new and changed (unequal records) in single dataset WORK.DIFF. You'll have to distinguish new vs changed yourself.
However, what you want to achieve is actually a MERGE - inserts new, overwrites existing, though maybe due to performance reasons etc. you don't want to re-create the full table.
data work.WANT;
merge work.HAVE1 work.HAVE2;
by Date Key;
run;
Edit1:
/* outdiff option will produce records with _type_ = 'DIF' for matched keys */
proc compare base=work.HAVE1 compare=work.HAVE2 out=WORK.RESULT outnoequal outcomp outdiff;
id Date Key;
run;
data WORK.DIFF_KEYS; /* keys of changed records */
set WORK.RESULT;
where _type_ = 'DIF';
keep Date Key;
run;
/* split NEW and CHANGED */
data
WORK.NEW
WORK.CHANGED
;
merge
WORK.RESULT (where=( _type_ ne 'DIF'));
WORK.DIFF_KEYS (in = d)
;
by Date Key;
if d then output WORK.CHANGED;
else output WORK.NEW;
run;
Edit2:
Now you can just APPEND the WORK.NEW to target table.
For WORK.CHANGED - either use MODIFY or UPDATE statement to update the records.
Depending on the size of the changes, you can also think about PROC SQL; DELETE to delete old records and PROC APPEND to add new values.
All a PROC COMPARE will do will tell you the differences between 2 datasets. To achieve your goal you need to use an UPDATE statement in a data step. This way, values in HAVE1 are updated with HAVE2 where the date and key match, or a new record inserted if there are no matches.
data have1;
input Date :date9. Key $ Var1 $ Var2;
format date date9.;
datalines;
01Aug2013 K1 a 2
01Aug2013 K2 a 3
02Aug2013 K1 b 4
;
run;
data have2;
input Date :date9. Key $ Var1 $ Var2;
format date date9.;
datalines;
01Aug2013 K1 a 3
01Aug2013 K2 a 3
02Aug2013 K1 b 4
03Aug2013 K2 c 1
;
run;
data want;
update have1 have2;
by date key;
run;
Related
I have a SAS dataset with 2 columns that I want to compare (VAR1 and VAR2). I would like to check if for each value of VAR1 this value exists anywhere in the column VAR2. If the VAR1 value does not exist anywhere in the column VAR2 I want to flag it as 1.
For exemple :
I have this :
TABLE in
VAR1
VAR2
k3
t7
t7
g7
p8
k3
...
...
And would want this
TABLE out
VAR1
VAR2
FLAG
k3
t7
0
t7
g7
0
p8
k3
1
...
...
...
I tried using
FLAG = ifn(indexw(VAR2,VAR1,0,1)
But this method only compare the two columns for the current row.
Thank you in advance for your help !
Edit : I tried running this code as suggested by Joe but ran into an error.
Code :
data your_table;
length VAR1 $2;
length VAR2 $2;
input VAR1 VAR2;
datalines;
k3 t7
t7 g7
p8 k3
;
data for_fmt;
set your_table;
fmtname = 'VAR2F';
start = var2;
label = '0';
output;
if _n_ eq 1 then do;
hlo = 'o';
start = .;
label = '1';
output;
end;
run;
proc sort nodupkey data=for_fmt;
by start;
run;
proc format cntlin=for_fmt;
quit;
data want;
set your_table;
flag = put(var1,var2f.);
run;
Error:
ERROR: This range is repeated, or values overlap: .- ..
In SAS, everything is based on one row at a time in the data step, so you can't do what you're looking to directly.
What you can do, though, is use a lookup technique - there are quite a few - and that will let you get what you're after.
The easiest one to use in your case is probably a format.
data for_fmt;
set your_table;
fmtname = 'VAR2F';
start = var2;
label = '0';
output;
if _n_ eq 1 then do;
hlo = 'o'; *this is for "other" (not found) records;
start = .;
label = '1';
output;
end;
run;
proc sort nodupkey data=for_fmt;
by start;
run;
proc format cntlin=for_fmt;
quit;
data want;
set your_table;
flag = put(var1,var2f.);
run;
This is pretty fast (only limited by dataset read/write time) unless you have millions of unique rows.
You could also merge the dataset to itself, or do this in SQL, or use a hash table, but the format approach is probably simplest.
As #Joe says, it is always on one row at a time in the data step. So if you can make all possible value of var1 in one row, you can do the character match work easily.
data your_table;
length VAR1 $2;
length VAR2 $2;
input VAR1 VAR2;
datalines;
k3 t7
t7 g7
p8 k3
;
data want;
array _char_[&sysnobs.]$200._temporary_;
do until(eof1);
set your_table end=eof1;
i+1;
_char_[i]=var2;
end;
do until(eof2);
set your_table end=eof2;
flag=1;
do i=1to dim(_char_) until(flag=0);
if var1=_char_[i] then flag=0;
end;
output;
end;
run;
Here's the method I typically use in situations like this. I first create a list of the variables to check against, then merge with that and can easily pick out the ones that are found.
proc sort data=have (keep=var2) out=var2levels (rename=(var2=var1)) nodupkey; by var2;
proc sort data=have; by var1;
data want;
merge have (in=in1) var2levels (in=in2);
by var1;
if in1;
flag = in2;
run;
So here the first proc sort creates a list of all the unique values of var2. The output data set renames that to var1 for merging purposes (this can be done more clearly but less efficiently by renaming multiple variables). Then we simply merge the original data set (keeping all records) with the list of existing var2 values and set the flag accordingly.
I want merge two tables, but they have 2 columns in commun, and i do not want value of var1 in A replaced by that in B, if we don't use drop or rename, does anyone know it?
I can fix it with sql but just curious with Merge!
data a;
infile datalines;
input id1 $ id2 $ var1;
datalines;
1 a 10
1 b 10
2 a 10
2 b 10
;
run;
/* create table B */
data b;
infile datalines;
input id1 $ id2 $ var1 var2;
datalines;
1 a 30 50
2 b 30 50
;
run;
/* Marge A and B */
data c;
merge a (in=N) b(in=M);
if N;
by id1;
run;
but what i like is:
data C;
infile datalines;
input id1 $ id2 $ var1 var2;
datalines;
1 a 10 50
1 b 10 50
2 a 10 50
2 b 10 50
;
run;
Use rename
data c;
merge a (in=N) b(in=M rename=(var1=var1_2));
by id1;
if N;
run;
If you don't want to use rename / drop etc., then you could just flip the merge order such that the datasets whose var1 should be retained overwrites the other:
data c;
merge b (in=M) a(in=N);
by id1;
if N;
run;
When the data step loads data from the datasets mentioned it does it in the order that they appear on the MERGE (or SET or UPDATE) statement. So if you are merging two dataset and the BY variables match values then the record from the first is loaded and the record from the second is loaded, overwriting the values read from the first.
For 1 to 1 matching you can just change the order that the datasets are mentioned.
merge b(in=M) a(in=N) ;
If you really want the variables defined in the output dataset in the order they appear in A then add a SET statement that the compiler will process but that can never execute before your MERGE statement.
if 0 then set a b ;
If you are doing a 1 to many matching then you might have other trouble since when a dataset stops contributing values to the current BY group then SAS does not re-read the last observation. In that case you will have to use some combination of RENAME=, DROP= or KEEP= dataset options.
In PROC SQL when you have duplicate names for selected columns (and are trying to create an output dataset instead of report) then SAS ignores the second copy of the named variable. So in a sense it is the reverse of what happens with the MERGE statement.
I need some help in trying to execute a comparison of rows within different ID variable groups, all in a single dataset.
That is, if there is any duplicate observation within two or more ID groups, then I'd like to delete the observation entirely.
I want to identify any duplicates between rows of different groups and delete the observation entirely.
For example:
ID Value
1 A
1 B
1 C
1 D
1 D
2 A
2 C
3 A
3 Z
3 B
The output I desire is:
ID Value
1 D
3 Z
I have looked online extensively, and tried a few things. I thought I could mark the duplicates with a flag and then delete based off that flag.
The flagging code is:
data have;
set want;
flag = first.ID ne last.ID;
run;
This worked for some cases, but I also got duplicates within the same value group flagged.
Therefore the first observation got deleted:
ID Value
3 Z
I also tried:
data have;
set want;
flag = first.ID ne last.ID and first.value ne last.value;
run;
but that didn't mark any duplicates at all.
I would appreciate any help.
Please let me know if any other information is required.
Thanks.
Here's a fairly simple way to do it: sort and deduplicate by value + ID, then keep only rows with values that occur only for a single ID.
data have;
input ID Value $;
cards;
1 A
1 B
1 C
1 D
1 D
2 A
2 C
3 A
3 Z
3 B
;
run;
proc sort data = have nodupkey;
by value ID;
run;
data want;
set have;
by value;
if first.value and last.value;
run;
proc sql version:
proc sql;
create table want as
select distinct ID, value from have
group by value
having count(distinct id) =1
order by id
;
quit;
This is my interpretation of the requirements.
Find levels of value that occur in only 1 ID.
data have;
input ID Value:$1.;
cards;
1 A
1 B
1 C
1 D
1 D
2 A
2 C
3 A
3 Z
3 B
;;;;
proc print;
proc summary nway; /*Dedup*/
class id value;
output out=dedup(drop=_type_ rename=(_freq_=occr));
run;
proc print;
run;
proc summary nway;
class value;
output out=want(drop=_type_) idgroup(out[1](id)=) sum(occr)=;
run;
proc print;
where _freq_ eq 1;
run;
proc print;
run;
A slightly different approach can use a hash object to track the unique values belonging to a single group.
data have; input
ID Value:& $1.; datalines;
1 A
1 B
1 C
1 D
1 D
2 A
2 C
3 A
3 Z
3 B
run;
proc delete data=want;
proc ds2;
data _null_;
declare package hash values();
declare package hash discards();
declare double idhave;
method init();
values.keys([value]);
values.data([value ID]);
values.defineDone();
discards.keys([value]);
discards.defineDone();
end;
method run();
set have;
if discards.find() ne 0 then do;
idhave = id;
if values.find() eq 0 and id ne idhave then do;
values.remove();
discards.add();
end;
else
values.add();
end;
end;
method term();
values.output('want');
end;
enddata;
run;
quit;
%let syslast = want;
I think what you should do is:
data want;
set have;
by ID value;
if not first.value then flag = 1;
else flag = 0;
run;
This basically flags all occurrences of a value except the first for a given ID.
Also I changed want and have assuming you create what you want from what you have. Also I assume have is sorted by ID value order.
Also this will only flag 1 D above. Not 3 Z
Additional Inputs
Can't you just do a sort to get rid of the duplicates:
proc sort data = have out = want nodupkey dupout = not_wanted;
by ID value;
run;
So if you process the observations by VALUE levels (instead of by ID levels) then you just need keep track of whether any ID is ever different than the first one.
data want ;
do until (last.value);
set have ;
by value ;
if first.value then first_id=id;
else if id ne first_id then remapped=1;
end;
if not remapped;
keep value id;
run;
I would like to turn the following long dataset:
data test;
input Id Injury $;
datalines;
1 Ankle
1 Shoulder
2 Ankle
2 Head
3 Head
3 Shoulder
;
run;
Into a wide dataset that looks like this:
ID Ankle Shoulder Head
1 1 1 0
2 1 0 1
3 0 1 1'
This answer seemed the most relevant but was falling over at the proc freq stage (my real dataset is around 1 million records, and has around 30 injury types):
Creating dummy variables from multiple strings in the same row
Additional help: https://communities.sas.com/t5/SAS-Statistical-Procedures/Possible-to-create-dummy-variables-with-proc-transpose/td-p/235140
Thanks for the help!
Here's a basic method that should work easily, even with several million records.
First you sort the data, then add in a count to create the 1 variable. Next you use PROC TRANSPOSE to flip the data from long to wide. Then fill in the missing values with a 0. This is a fully dynamic method, it doesn't matter how many different Injury types you have or how many records per person. There are other methods that are probably shorter code, but I think this is simple and easy to understand and modify if required.
data test;
input Id Injury $;
datalines;
1 Ankle
1 Shoulder
2 Ankle
2 Head
3 Head
3 Shoulder
;
run;
proc sort data=test;
by id injury;
run;
data test2;
set test;
count=1;
run;
proc transpose data=test2 out=want prefix=Injury_;
by id;
var count;
id injury;
idlabel injury;
run;
data want;
set want;
array inj(*) injury_:;
do i=1 to dim(inj);
if inj(i)=. then inj(i) = 0;
end;
drop _name_ i;
run;
Here's a solution involving only two steps... Just make sure your data is sorted by id first (the injury column doesn't need to be sorted).
First, create a macro variable containing the list of injuries
proc sql noprint;
select distinct injury
into :injuries separated by " "
from have
order by injury;
quit;
Then, let RETAIN do the magic -- no transposition needed!
data want(drop=i injury);
set have;
by id;
format &injuries 1.;
retain &injuries;
array injuries(*) &injuries;
if first.id then do i = 1 to dim(injuries);
injuries(i) = 0;
end;
do i = 1 to dim(injuries);
if injury = scan("&injuries",i) then injuries(i) = 1;
end;
if last.id then output;
run;
EDIT
Following OP's question in the comments, here's how we could use codes and labels for injuries. It could be done directly in the last data step with a label statement, but to minimize hard-coding, I'll assume the labels are entered into a sas dataset.
1 - Define Labels:
data myLabels;
infile datalines dlm="|" truncover;
informat injury $12. labl $24.;
input injury labl;
datalines;
S460|Acute meniscal tear, medial
S520|Head trauma
;
2 - Add a new query to the existing proc sql step to prepare the label assignment.
proc sql noprint;
/* Existing query */
select distinct injury
into :injuries separated by " "
from have
order by injury;
/* New query */
select catx("=",injury,quote(trim(labl)))
into :labls separated by " "
from myLabels;
quit;
3 - Then, at the end of the data want step, just add a label statement.
data want(drop=i injury);
set have;
by id;
/* ...same as before... */
* Add labels;
label &labls;
run;
And that should do it!
I want to use dataset B to overwrite some values in dataset A by merging dataset A & B with a merging ID. However it doesn't work as expected. Here is the test I did:
/* create table A */
data a;
infile datalines;
input id1 $ id2 $ var1;
datalines;
1 a 10
1 b 10
2 a 10
2 b 10
;
run;
/* create table B */
data b;
infile datalines;
input id1 $ var1 var2;
datalines;
1 20 30
2 20 30
;
run;
/* merge A&B to overwrite var1 in table A using values in table B */
data c;
merge a b;
by id1;
run;
Table C looks like this:
ID1 ID2 VAR1 VAR2
1 a 20 30
1 b 10 30
2 a 20 30
2 b 10 30
Why the 10s in row 2&4 didn't get replaced by 20 from table B? While var2 works as expected?
I know I can do this simply using proc SQL, and that's what I did to solve the problem. But I still quite curious if there is a way to do what I wanted using merge? And why this wasn't working? I prefer merge over SQL in this circumstance because the logic is easier to implement (util I found this not working properly).
I use SAS 9.4.
This has to do with how SAS iterates over the data sets during the merge. Basically, the second record for each of A doesn't get lined up with a record from B. The value of VAR2 is carried over from the previous record. VAR1 gets its value from A (because there is no B).
IF there is record in B for EVERY ID1, then you can rewrite your merge like this to achieve what you want.
/* merge A&B to overwrite var1 in table A using values in table B */
data c;
merge a(drop=var1) b;
by id1;
run;
This drops the VAR1 from A so that it is carried down from the record in B.
Otherwise you will need more complex logic (might I suggest an SQL left join with the coalesce() function?).
Like DomPazz suggests, proc sql is the way to do this. merge will only keep one value from each data set. The coalesce function pick the first non-missing value from the list, so it uses var1 from b, but if b.var1 is null then it uses a.var1.
proc sql;
create table c as
select
a.id1,
a.id2,
coalesce(b.var1,a.var1) as var1,
b.var2
from
a
left join b
on a.id1 = b.id1
;
quit;
The merge method could still work fine, you would just need to be more explicit about how to choose the 'best' value for var1, such as:
data c (drop = a_var1 b_var1);
merge a(rename=(var1 = a_var1))
b(rename=(var1 = b_var1));
by id1;
* Now you have two different variables named a_var1 and b_var1;
* Implement logic to choose your favorite;
if NOT MISSING(b_var1) Then DO;
var1 = b_var1;
var1_source='B';
END;
else DO;
var1 = a_var1;
var1_source='A';
END;
run;
If your criteria for which 'var1' to choose is as simple as 'If b exists, use it' then this is identical to the the SQL method with coalesce().
Where I've found this method useful is for more complicated criteria, plus its always nice to know the source of the data (which doesn't happen with coalesce()).