Update the values of a column in a dataset with another table - sas

If I have Table A with two columns: ID and Mean, and Table B with a long list of columns including Mean, how can I replace the values of the Mean column in Table B with the IDs that exist in Table A?
I've tried PROC SQL UPDATE and both DATASET MERGE and DATASET UPDATE but they keep adding rows when the number of columns is not equal in both tables.

data want;
merge have1(in=H1) have2(in=H2);
by mergevar;
if H1;
run;
That will guarantee that H2 does not add any rows, unless there are duplicate values for one of the by values. Other conditions can be used as well; if h2; would do about the same thing for the right-hand dataset, and if h1 and h2; would only keep records that come from both tables.
PROC SQL join should also work fairly easily.
proc sql;
create table want as
select A.id, coalesce(B.mean, A.mean)
from A left join B
on A.id=B.id;
quit;

Related

How to insert records from other tables to a table with more columns in sas

I have two sas tables, A and B, A has two columns (i.e., columna columnb) and table B has four columns (i.e., columna columnb columnc columnd ), I wish to insert records from table A to table B, I tried the following, but it shows me errors:
PROC SQL;
insert into B
select *, columnc='a', columnd='b' from A;
QUIT;
Assuming you just want to leave the extra columns empty then don't include them in the insert. It is much easier to just use SAS code instead of SQL code.
proc append base=b data=a force nowarn;
run;
For the SQL Insert statement you need to specify which columns in the target table you are writing into, otherwise it assumes you will specify values for all of them.
insert into B (columna,columnb)
select columna,columnb
from A
;
If instead you want to fill the extra columns with constants then include the constants in the SELECT list.
insert into B (columna,columnb,columnc,columnd)
select columna,columnb,'a','b'
from A
;
If you are positive that you are providing the values in the right order then you can leave the column names off of the target table specification.
insert into B
select *,'a','b'
from A
;
You can't specify the variable name that way; in fact, you can't specify the variable at all using insert into. See this example:
proc sql;
create table class like sashelp.class;
alter table class
add rownum numeric;
alter table class
add othcol numeric;
insert into class
select *, 1 as othcol, monotonic() as rownum from sashelp.class;
quit;
Here I use as to specify the column name, but notice that it doesn't actually work: it puts 1 in the rownum column, and the monotonic() value in othcol, since they're in that order on the table.

Conditional Merge in SAS data step

I have two tables, which for the sake of simplicity are:
Table 1:
Date ID
20170401 X
20170501 Y
20170601 Z
Table 2:
Date ID
20170201 Z
20170301 Y
20170501 X
I want to create a new table which has everything from table 1, unless the ID occurred at a previous date in table 2.
The desired output for table 1 and 2 would be:
Date ID
20170401 X
This is what I currently have. I'm not sure where to put the conditional merge:
data new;
merge table1 table2(in=b);
by date ID;
if not b [where table2.date is before table1.date];
run;
Thanks
Unfortunately SAS overwrites variables with the same names, so rename date in table2. Merge on ID and only keep if it's in table1 and passes your date criterion. Then drop the date2 column as it's not needed.
data new;
merge table1 (in=a) table2(in=b rename=(date=date2));
by ID;
if a and not(date2<date);
drop date2;
run;

Multiple left join with same columns

I have 3 Data sets A,B,C which contains the following variables
A:
period region city
B:
period city Sales
C:
period region Sales
My goal is to do a left join on A using B and C to get the Sales information based on the geographic location. I tried to in the sequence of steps:
/* Left joining B to A based on period and region */
proc sql;
Create table merge1 as
select l.* , r.* from
A as l left join B as r
on l.period = r.period and l.city=r.city;
quit;
/* Renaming Sales variable*/
data merged2;
set merge1;
rename Sales= s1;
run;
/*Doing another left join again, this time using C*/
proc sql;
create table merge3 as
select l.*,r.* from
A as l left join C as r
on l.period= r.period and l.region=r.region;
quit;
/*Replacing some of the values*/
data merge4;
set merge3;
Sales1= IFN(s1=., Sales, s1);
drop s1 Sales;
run;
My question would be if there are much better/ efficient ways to go about this? Especially on the multiple left joins since the process will get really tedious as the number of datasets and varaibles to be matched increases, thanks!
You could do it in a single SQL procedure. Since you have multiple tables, you will have to join them one by one.
proc sql;
Create table merge1 as select
A.* ,
B.sales as s1,
C.sales as s2,
coalesce(B.sales, C.sales) as Sales /*takes first non missing value*/
from A
left join B on (A.period = B.period and A.city = B.city)
left join C on (A.period = C.period and A.region = C.region);
quit;

subset data by thousands of observations

I have two data sets:
Data set 1: This dataset has 2300 rows. The jobID is the same throughout the dataset, but the Hash is unique throughout the dataset
Hash jobID
3456343454 1077
3453454 1077
43673 1077
.... and so on
Data set 2: This dataset has 5838918 rows. Different JobID values and different Hash values such as the following:
Hash jobID
2223422 2
233435 155
2344322 1171
... and so on
What I am trying to attempt is to see whether any of the Hash values that is part of the first dataset also exists in the second dataset. Since it's over a thousand different unique Hash values in the first dataset, I cannot type each one of them to see whether it exists in the second dataset like the following:
if hash in (value1 value2...etc), and to show the all Hash values that exist in Dataset 1 but not in dataset 2.
What is the best way to go about doing this?
Also, the HASH is is Character ($32 format and informat), while JobID is numeric (Format:Best12./ Informat 12.)
Use a SQL query to create the second list, you don't have to manually list the values.
proc sql;
create table in1_not2 as
select *
from table1 as a
where a.hash not in
(select b.hash from table2 as b);
quit;
If I understood correctly, you could check with a simple merge.
Order the two datasets by Hash:
proc sort data=dataset1; by Hash; run;
proc sort data=dataset2; by Hash; run;
Check if the Hash is in both datasets:
data check;
merge dataset1 (in=a keep=Hash)
dataset2 (in=b keep=Hash);
by Hash;
if a and b;
run;
Note that all I'm doing is checking the hash, I'm not bringing any other variable to the final dataset.

select only a few columns from a large table in SAS

I have to join 2 tables on a key (say XYZ). I have to update one single column in table A using a coalesce function. Coalesce(a.status_cd, b.status_cd).
TABLE A:
contains some 100 columns. KEY Columns ABC.
TABLE B:
Contains just 2 columns. KEY Column ABC and status_cd
TABLE A, which I use in this left join query is having more than 100 columns. Is there a way to use a.* followed by this coalesce function in my PROC SQL without creating a new column from the PROC SQL; CREATE TABLE AS ... step?
Thanks in advance.
You can take advantage of dataset options to make it so you can use wildcards in the select statement. Note that the order of the columns could change doing this.
proc sql ;
create table want as
select a.*
, coalesce(a.old_status,b.status_cd) as status_cd
from tableA(rename=(status_cd=old_status)) a
left join tableB b
on a.abc = b.abc
;
quit;
I eventually found a fairly simple way of doing this in proc sql after working through several more complex approaches:
proc sql noprint;
update master a
set status_cd= coalesce(status_cd,
(select status_cd
from transaction b
where a.key= b.key))
where exists (select 1
from transaction b
where a.ABC = b.ABC);
quit;
This will update just the one column you're interested in and will only update it for rows with key values that match in the transaction dataset.
Earlier attempts:
The most obvious bit of more general SQL syntax would seem to be the update...set...from...where pattern as used in the top few answers to this question. However, this syntax is not currently supported - the documentation for the SQL update statement only allows for a where clause, not a from clause.
If you are running a pass-through query to another database that does support this syntax, it might still be a viable option.
Alternatively, there is a way to do this within SAS via a data step, provided that the master dataset is indexed on your key variable:
/*Create indexed master dataset with some missing values*/
data master(index = (name));
set sashelp.class;
if _n_ <= 5 then call missing(weight);
run;
/*Create transaction dataset with some missing values*/
data transaction;
set sashelp.class(obs = 10 keep = name weight);
if _n_ > 5 then call missing(weight);
run;
data master;
set transaction;
t_weight = weight;
modify master key = name;
if _IORC_ = 0 then do;
weight = coalesce(weight, t_weight);
replace;
end;
/*Suppress log messages if there are key values in transaction but not master*/
else _ERROR_ = 0;
run;
A standard warning relating to the the modify statement: if this data step is interrupted then the master dataset may be irreparably damaged, so make sure you have a backup first.
In this case I've assumed that the key variable is unique - a slightly more complex data step is needed if it isn't.
Another way to work around the lack of a from clause in the proc sql update statement would be to set up a format merge, e.g.
data v_format_def /view = v_format_def;
set transaction(rename = (name = start weight = label));
retain fmtname 'key' type 'i';
end = start;
run;
proc format cntlin = v_format_def; run;
proc sql noprint;
update master
set weight = coalesce(weight,input(name,key.))
where master.name in (select name from transaction);
run;
In this scenario I've used type = 'i' in the format definition to create a numeric informat, which proc sql uses convert the character variable name to the numeric variable weight. Depending on whether your key and status_cd columns are character or numeric you may need to do this slightly differently.
This approach effectively loads the entire transaction dataset into memory when using the format, which might be a problem if you have a very large transaction dataset. The data step approach should hardly use any memory as it only has to load 1 row at a time.