Selecting Max Value from Left Join

Selecting Max Value from Left Join - sas

I have two tables like below.
Profile : ID
Charac : ID, NAME, DATE
With the above tables, I am trying to get NAME from Charac where we have max date.
I am trying to do a join with proc sql by replicating the answer for mysql like below
proc sql;
create table ggg as
select profile.ID ,T2.NAME
from Profile
left join
( select ID,max(DATE) as max_DATE
from EDW.CHARAC
group by ID
) as T1
on fff.ID = EDW.ID
left join EDW.CHARAC as T2
on T2.ID = T1.max_DATE
order by profile.ID DESC;
quit;
Error
ERROR: Unresolved reference to table/correlation name EDW.
ERROR: Expression using equals (=) has components that are of different data types.

Could it be you intended
on T2.ID = T1.max_DATE
which is probably source of "components that are of different data types" error
to be:
on T2.ID = T1.ID and T2.DATE = T1.max_DATE
that, is - joining on IDs at maximum DATE?

You can't use EDW like that. You need to join
on fff.ID=T1.ID
As far as data types, that probably is because EDW.ID is undefined and thus numeric by default.

Related

Joining Together Tables in PROC SQL

I want to join two tables that I created into one table but I am getting a syntax error that says Column OverallStudentReport.ID was found in more than one table in the same scope. If anyone could help fix this syntax error that would be appreciated or if anyone has a better way to join these two tables together into one that would be helpful as well.
The code below created my first table
PROC SQL;
Create table SemesterReport1 as select coalesce(A.ID,B.ID,C.ID,D.ID,E.ID) as ID,
coalesce(A.Year,B.Year,C.Year,D.Year,E.Year) as Year, coalesce(A.Term,B.Term,C.Term,D.Term,E.Term) as Term,
SemesterGPA.SemGPA, AccumulativeGPA.GPAAccum,
CreditHoursEarnedSemester.CreditHoursEarnedSemester,
GradedCreditHoursEarnedSemester.GradedCreditHoursEarnedSemester,
ClassStanding.ClassStanding
from SemesterGPA as A
full join AccumulativeGPA as B on A.ID=B.ID and A.Year=B.Year and A.Term=B.Term
full join CreditHoursEarnedSemester as C on A.ID=C.ID and A.Year=C.Year and A.Term=C.Term
full join GradedCreditHoursEarnedSemester as D on A.ID=D.ID and A.Year=D.Year and A.Term=D.Term
full join ClassStanding as E on A.ID=E.ID and A.Year=E.Year and A.Term=E.Term
order by ID, Year, Term
;
quit;
The code below created my second table
PROC SQL;
Create table OverallStudentReport as select coalesce(A.ID,B.ID,C.ID,D.ID,E.ID) as ID,
OverallGPA.TotalGPA,
OverallCreditHoursEarned.OverallCreditHoursEarned,
OverallGradedCreditHoursEarned.OverallGradedCreditHoursEarned,
RepeatClasses.RepeatClasses,
GradeCounts.ACount,GradeCounts.BCount,GradeCounts.CCount,GradeCounts.DCount,
GradeCounts.ECount, GradeCounts.WCount
from OverallGPA as A
full join OverallCreditHoursEarned as B on A.ID=B.ID
full join OverallGradedCreditHoursEarned as C on A.ID=C.ID
full join RepeatClasses as D on A.ID=D.ID
full join GradeCounts as E on A.ID=E.ID
order by ID
;
quit;
and the code below is supposed to join the two tables created above but there is a syntax error.
PROC SQL;
Create table Report1 as select *
from SemesterReport1, OverallStudentReport
full join
OverallStudentReport
on SemesterReport1.ID=OverallStudentReport.ID
order by ID
;
quit;
Here is my log
1 OPTIONS NONOTES NOSTIMER NOSOURCE NOSYNTAXCHECK;
72
73 PROC SQL;
74 Create table Report1 as select *
75 from SemesterReport1, OverallStudentReport
76 full join
77 OverallStudentReport
78 on SemesterReport1.ID=OverallStudentReport.ID
79 order by ID
80 ;
ERROR: Column OverallStudentReport.ID was found in more than one table in the same scope.
WARNING: Column named ID is duplicated in a select expression (or a view). Explicit references to it will be to the first one.
NOTE: PROC SQL set option NOEXEC and will continue to check the syntax of statements.
81 quit;
NOTE: The SAS System stopped processing this step because of errors.
NOTE: PROCEDURE SQL used (Total process time):

When you assign table aliases, you should use them consistently throughout the query, not just selectively in SELECT and JOIN. Also, fields in ORDER BY is ambiguous. Since you require the calculated columns in SELECT use calculated keyword.
By the way, see Bad Habits to Kick : Using table aliases like (a, b, c) or (t1, t2, t3). Instead, use more informative shorthand aliases that align to original table names. Consider following adjustments:
PROC SQL;
create table SemesterReport1 as
select coalesce(s.ID, a.ID, ch.ID, g.ID, cs.ID) as Final_ID
, coalesce(s.Year, a.Year, ch.Year, g.Year, cs.Year) as Final_Year
, coalesce(s.Term, a.Term, ch.Term, g.Term, cs.Term) as Final_Term
, s.SemGPA
, a.GPAAccum
, ch.CreditHoursEarnedSemester
, g.GradedCreditHoursEarnedSemester
, cs.ClassStanding
from SemesterGPA as s
full join AccumulativeGPA as a
on s.ID = a.ID
and s.Year = a.Year
and s.Term = a.Term
full join CreditHoursEarnedSemester as ch
on s.ID = ch.ID
and s.Year = ch.Year
and s.Term = ch.Term
full join GradedCreditHoursEarnedSemester as g
on s.ID = g.ID
and s.Year = g.Year
and s.Term = g.Term
full join ClassStanding as cs
on s.ID = cs.ID
and s.Year = cs.Year
and s.Term = cs.Term
order by calculated Final_ID
, calculated Final_Year
, calculated Final_Term;
quit;
PROC SQL;
create table OverallStudentReport as
select coalesce(og.ID, och.ID, ogch.ID, r.ID, gc.ID) as Final_ID
, og.TotalGPA
, och.OverallCreditHoursEarned
, ogch.OverallGradedCreditHoursEarned
, r.RepeatClasses
, gc.ACount
, gc.BCount
, gc.CCount
, gc.DCount
, gc.ECount
, gc.WCount
from OverallGPA as og
full join OverallCreditHoursEarned as och
on og.ID = och.ID
full join OverallGradedCreditHoursEarned as ogch
on og.ID = ogch.ID
full join RepeatClasses as r
on og.ID = r.ID
full join GradeCounts as gc
on og.ID = gc.ID
order by calculated Final_ID;
quit;
Then in final query, do not repeat table OverallStudentReport. And you should qualify the ID (here being Final_ID) in order by. And see another habit to kick: Why is SELECT * considered harmful?
PROC SQL;
create table Report1 as
select smr.Final_ID as ID
, smr.Final_Year as Year
, smr.Final_Term as Term
, smr.SemGPA
, smr.GPAAccum
, smr.CreditHoursEarnedSemester
, smr.GradedCreditHoursEarnedSemester
, smr.ClassStanding
, osr.Final_ID
, osr.TotalGPA
, osr.OverallCreditHoursEarned
, osr.OverallGradedCreditHoursEarned
, osr.RepeatClasses
, osr.ACount
, osr.BCount
, osr.CCount
, osr.DCount
, osr.ECount
, osr.WCount
from SemesterReport1 smr
full join OverallStudentReport osr
on smr.Final_ID = osr.Final_ID
order by smr.Final_ID ;
quit;

How to prevent left join from returning multiple rows

While using left join in SAS, the right side table have duplicate IDs with different donations. Therefore, it returns several rows.
While i only want one row with the highest donated amount.
The code is as follows:
Create table x
As select T1.*,
T2. Donations
From xxx t1
Left join yy t2 on (t1.id = t2.id);
Quit;
Thanks for any help

IN SAS follow https://stackoverflow.com/a/61486331/8227346
and in mysql
you can use partioning with ROW_NUMBER
CREATE TABLE x As select T1.*, T2.Donations
From xxx t1
LEFT JOIN
(
SELECT * FROM
(
SELECT
*,
ROW_NUMBER() OVER (PARTITION BY id ORDER BY donated_amount DESC) rank
FROM
yy
)
WHERE
rank = 1
)
t2
ON (t1.id = t2.id);
More info can be found https://www.c-sharpcorner.com/blogs/rownumber-function-with-partition-by-clause-in-sql-server1

You can either work with a subselect which selects only the highest donation for a given ID or you could do some pre work with SAS (which i prefer):
*Order ascending by ID and DONATIONS;
proc sort data=work.t2;
by ID DONATIONS;
run;
*only retain the dataset with the highest DONATION per ID;
data work.HIGHEST_DONATIONS;
set work.t2;
by ID;
if last.ID then output;
run;
I don't have SAS available right now but it should work.
Don't hesitate asking further questions. :)

Comparing two date variables in SAS in a proc sql WHERE clause

I am using SAS Enterprise guide and want to compare two date variables:
My code looks as follows:
proc sql;
CREATE TABLE observations_last_month AS
SELECT del_flag_1,
gross_exposure_fx,
reporting_date format=date7.,
max(reporting_date) AS max_date format=date7.
FROM &dataIn.
WHERE reporting_date = max_date;
quit;
If I run my code without the WHEREstatement I get the following data:
However, when I run the above code I get the following error messages:
ERROR: Expression using (=) has components that are of different data types.
ERROR: The following tables were not found in the contributing tables: max_date.
What am I doing wrong here? Thanks up front for the help

If you want to subset based on an aggregate function then you need to use HAVING instead of WHERE. If you want to refer to a variable that you have derived in your query then you need to use the CALCULATED keyword (or just re-calculate it).
proc sql;
CREATE TABLE observations_last_month AS
SELECT del_flag_1
, gross_exposure_fx
, reporting_date format=date7.
, max(reporting_date) AS max_date format=date7.
FROM &dataIn.
HAVING reporting_date = CALCULATED max_date
;
quit;

select only a few columns from a large table in SAS

I have to join 2 tables on a key (say XYZ). I have to update one single column in table A using a coalesce function. Coalesce(a.status_cd, b.status_cd).
TABLE A:
contains some 100 columns. KEY Columns ABC.
TABLE B:
Contains just 2 columns. KEY Column ABC and status_cd
TABLE A, which I use in this left join query is having more than 100 columns. Is there a way to use a.* followed by this coalesce function in my PROC SQL without creating a new column from the PROC SQL; CREATE TABLE AS ... step?
Thanks in advance.

You can take advantage of dataset options to make it so you can use wildcards in the select statement. Note that the order of the columns could change doing this.
proc sql ;
create table want as
select a.*
, coalesce(a.old_status,b.status_cd) as status_cd
from tableA(rename=(status_cd=old_status)) a
left join tableB b
on a.abc = b.abc
;
quit;

I eventually found a fairly simple way of doing this in proc sql after working through several more complex approaches:
proc sql noprint;
update master a
set status_cd= coalesce(status_cd,
(select status_cd
from transaction b
where a.key= b.key))
where exists (select 1
from transaction b
where a.ABC = b.ABC);
quit;
This will update just the one column you're interested in and will only update it for rows with key values that match in the transaction dataset.
Earlier attempts:
The most obvious bit of more general SQL syntax would seem to be the update...set...from...where pattern as used in the top few answers to this question. However, this syntax is not currently supported - the documentation for the SQL update statement only allows for a where clause, not a from clause.
If you are running a pass-through query to another database that does support this syntax, it might still be a viable option.
Alternatively, there is a way to do this within SAS via a data step, provided that the master dataset is indexed on your key variable:
/*Create indexed master dataset with some missing values*/
data master(index = (name));
set sashelp.class;
if _n_ <= 5 then call missing(weight);
run;
/*Create transaction dataset with some missing values*/
data transaction;
set sashelp.class(obs = 10 keep = name weight);
if _n_ > 5 then call missing(weight);
run;
data master;
set transaction;
t_weight = weight;
modify master key = name;
if _IORC_ = 0 then do;
weight = coalesce(weight, t_weight);
replace;
end;
/*Suppress log messages if there are key values in transaction but not master*/
else _ERROR_ = 0;
run;
A standard warning relating to the the modify statement: if this data step is interrupted then the master dataset may be irreparably damaged, so make sure you have a backup first.
In this case I've assumed that the key variable is unique - a slightly more complex data step is needed if it isn't.
Another way to work around the lack of a from clause in the proc sql update statement would be to set up a format merge, e.g.
data v_format_def /view = v_format_def;
set transaction(rename = (name = start weight = label));
retain fmtname 'key' type 'i';
end = start;
run;
proc format cntlin = v_format_def; run;
proc sql noprint;
update master
set weight = coalesce(weight,input(name,key.))
where master.name in (select name from transaction);
run;
In this scenario I've used type = 'i' in the format definition to create a numeric informat, which proc sql uses convert the character variable name to the numeric variable weight. Depending on whether your key and status_cd columns are character or numeric you may need to do this slightly differently.
This approach effectively loads the entire transaction dataset into memory when using the format, which might be a problem if you have a very large transaction dataset. The data step approach should hardly use any memory as it only has to load 1 row at a time.

Concatenate two SAS data sets, but only if ID appear in both

I want to concatenate two SAS data sets, one from 2003 and one from 2013. There is a uniq identifier in both, and I'll only allow allow records to be concatenated if they appears in both.
NB. there is multiple records with the same ID.

Here's some untested code:
proc sql;
create table want as
select * from(
select * from t1 where t1.id in (select t2.id in t2)
union
select * from t2 where t2.id in (select t1.id in t1)) as A;
quit;

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Selecting Max Value from Left Join - sas

Could it be you intended on T2.ID = T1.max_DATE which is probably source of "components that are of different data types" error to be: on T2.ID = T1.ID and T2.DATE = T1.max_DATE that, is - joining on IDs at maximum DATE?

You can't use EDW like that. You need to join on fff.ID=T1.ID As far as data types, that probably is because EDW.ID is undefined and thus numeric by default.

Related

Joining Together Tables in PROC SQL

How to prevent left join from returning multiple rows

Comparing two date variables in SAS in a proc sql WHERE clause

select only a few columns from a large table in SAS

Concatenate two SAS data sets, but only if ID appear in both

Categories

Resources