Why does my GROUP BY statement in my subquery get ignored and give me duplicates? [duplicate] - sas

This question already has answers here:
Two SQL LEFT JOINS produce incorrect result
(3 answers)
Closed 3 months ago.
I am using Proc SQL within SAS.
When I use the GROUP BY statement in my main query, I get this error:
NOTE: A GROUP BY clause has been discarded because neither the SELECT clause nor the optional HAVING clause of the associated table-expression referenced a summary function.
SELECT
HP.Area
,HP.Name
,HP.NPI
,FACILITIES.ID
,(SELECT COUNT(*) FROM EVAL.CITATIONS C
WHERE C.ID = CITATIONS.ID
) AS Total_Citations
FROM EVAL.HP HP
LEFT JOIN EVAL.FACILITIES FACILITIES
ON FACILITIES.NPI = HP.NPI
LEFT JOIN EVAL.CITATIONS CITATIONS
ON CITATIONS.ID = FACILITIES.ID
GROUP BY CITATIONS.ID
When I run this program:
I get duplicate results
The Total_Citations counts all rows in the citations table because it's ignoring the group by statement.
Output:
HP.Area
HP.Name
HP.NPI
FACILITIES.ID
Total_Citations
AV
OMG Inc.
1234
001
17026
AV
OMG Inc.
1234
001
17026
AV
Why
1241
512
17026
AV
Why
1241
512
17026
BP
Dis
8305
643
17026
BP
Happening
8221
346
17026
It should look like:
HP.Area
HP.Name
HP.NPI
FACILITIES.ID
Total_Citations
AV
OMG Inc.
1234
001
14
AV
Why
1241
512
0
BP
Dis
8305
643
0
BP
Happening
8221
346
36
The HP table is my main table and I want to left join FACILITIES and CITATIONS tables. FACILITES has the unique identifier (NPI) that connects HP and CITATIONS tables together. CITATIONS has a row for each citation for every facility for a given time period. I am trying to get the total number of citations from CITATIONS per ID.

I suggest the following as a solution:
proc sql;
SELECT
HP.Area
,HP.Name
,HP.NPI
,HP.ID
,COUNT(*) AS Total_Citations
FROM EVAL.HP
LEFT JOIN EVAL.CITATIONS
ON CITATIONS.ID = HP.ID
GROUP BY HP.Area, HP.Name, HP.NPI, HP.ID;
quit;
Here, the HP table is used to get the area, name etc. and the table is joined with the CITATIONS table. Your subquery is not necessary because the join will already give you the wanted number of citations per ID. If you want to have Area, Name and ID in your resulting table then you should add these columns in the group-by clause, as well.
I could not see the relevance of your third table but if you need columns from the FACILITIES table then you can join this table and the wanted columns but put these columns also in the group-by. Note: If there are more than one entry for an ID in the FACILITIES table then you will get duplicates in your result again.

Related

Check if value is in another table and add columns in Power BI

I have 2 tables, table1 contains some survey data and table2 is a full list of students involved. I want to check if Name in table2 is also found in table1. If yes, add Age and Level information in table2, otherwise, fill these columns with no data.
table1:
id Name Age Level
32 Anne 13 Secondary school
35 Jimmy 5 Primary school
38 Becky 10 Primary school
40 Anne 13 Secondary school
table2:
id Name
1 Anne
2 Jimmy
3 Becky
4 Jack
Expected output:
id Name Age Level
1 Anne 13 Secondary school
2 Jimmy 5 Primary school
3 Becky 10 Primary school
4 Jack no data no data
Update:
I created a relationship between table1 and table2 using the common column id(which can be repeated in table1).
Then I used:
Column = RELATED(table1[AGE])
but it caught error:
The column 'table1[AGE]' either doesn't exist or doesn't have a relationship to any table available in the current context.
There are various ways to achieve the desired output, but the simplest of them I found is to use the RELATED DAX function. If you found this answer then mark it as the answer.
Create a relationship between table1 and table2 using 'Name` column.
Create a calculated column in table2 as:
Column = RELATED(table1[AGE])
Repeat the same step for the Level column also.
Column 2 = RELATED(table1[LEVEL])
This will give you a table with ID, Name, Age, and Level for the common names between the two tables.
Now to fill those empty rows as no data, simply create another calculated column with following DAX:
Column 3 = IF(ISBLANK(table2[Column]), "no data", table2[Column])
Column 4 = IF(ISBLANK(table2[Column 2]), "no data", table2[Column 2])
This will give you the desired output.
EDIT:- You can also use the following formula to do the same thing in a single column
Column =
VAR X = RELATED(table`[AGE])
VAR RES = IF(ISBLANK(X), "no data", X)
RETURN
RES
and
Column 2 =
VAR X = RELATED(table1[LEVEL])
VAR RES = IF(ISBLANK(X), "no data", X)
RETURN
RES
This will also give you the same output.

Include 'all' column when limiting in where statement

Sorry if this has been answered, I've searched but had a really hard time getting anything close to right. So, in proc tabulate, I keep running into an issue where I want to be able to create tables that have a Total column, but it's obviously a little more complicated than that. For example, let's say I need to make a table that has the appropriate statistic columns for Arizona participants, and then the stat columns for all participants. If I limit the where statement to be where State = Arizona, obviously the total column (using All) will only actually include Arizona participants, which is not what I want. A workaround for smaller number of tables is to make one table that's not limited, and then one that is limited, and copy and paste, but that's not really something I want to do when I have 90 sets of tables, one set for each state.
The only thing that comes to my mind is creation of some sort of dummy variable, but I'm not sure how to go about that.
EDIT:
Desired table (in this particular case I'm searching for help on, I guess it's not a column, but if the solution ends up only working for a column I could probably restructure my table). I ultimately want to have it make one file for each state, and in each file each of the questions is broken down individually, showing the All-States total and the State Total. I have a macro set up to do that.
Consider a SQL solution where you use a derived table subquery to include the All data aggregation in new column(s). Specifically, you will cross join the All query with the State query as no join keys are used. And since the total aggregate query yields only scalar values, it will repeat for every row.
Example Data
* ID Participant Score State
* 1 Angela Andrews 415 Arizona
* 2 Brandon Baker 813 Arizona
* 3 Charlene Clark 323 Arizona
* 4 David Douglas 689 Illinois
* 5 Erin Ellis 501 Illinois
* 6 Frank Fillmore 739 Illinois
SAS Code
Note: Aggregate functions used for All columns in outer main query are interchangeable as only one value is being operated on Max(Val) = Min(Val) = Mean(Val) ... -variety included for illustration:
proc sql;
CREATE TABLE newdata AS
SELECT data.State,
COUNT(data.Score) As StateCount,
SUM(data.Score) As StateTotal,
MEAN(data.Score) As StateMean,
MEDIAN(data.Score) As StateMedian,
STD(data.Score) As StateSteDev,
VAR(data.Score) As StateVariance,
MAX(total.AllCount) As AllCount,
MIN(total.AllTotal) As AllTotal,
MEAN(total.AllMean) As AllMean,
MEDIAN(total.AllMedian) As AllMedian,
MAX(total.AllMedian) As AllSteDev,
AVG(total.AllVariance) As AllVariance
FROM data,
(SELECT COUNT(data.Score) As AllCount,
SUM(data.Score) As AllTotal,
MEAN(data.Score) As AllMean,
MEDIAN(data.Score) As AllMedian,
STD(data.Score) As AllSteDev,
VAR(data.Score) As AllVariance
FROM data sub) As total
GROUP BY data.State;
quit;
Also you can always limit outer main query using a WHERE clause: WHERE data.State = 'Arizona'
Output
Obs State StateCount StateTotal StateMean StateMedian StateSteDev StateVariance AllCount AllTotal AllMean AllMedian AllSteDev AllVariance
1 Arizona 3 1551 517 415 260.438 67828 6 3480 580 595 595 38193.2
2 Illinois 3 1929 643 689 125.491 15748 6 3480 580 595 595 38193.2

SAS for the following scenario [duplicate]

This question already has answers here:
SAS- Condensing Multiple Rows, Keeping highest Value
(2 answers)
Closed 6 years ago.
Assume I have a data-set D1 as follows:
ID ATR1 ATR2 ATR3
1 23 10 11
2 22 11 14
1 19 14 15
2 34 6 17
3 10 11 5
I want to create a data-set D2 from this as follows
ID ATR1 ATR2 ATR3
1 23 14 15
2 34 11 17
3 10 11 5
In other words, Data-set D2 consists of unique IDs from D1. For each ID in D2, the values of ATR1-ATR3 are selected as the maximum (of the respective variable) among the records in D1 with the same ID. For example ID = 1 in D2 has ATR1 = max(23,19) = 23.
I have one solution which is very clumsy. I simply sort copies of the data set `D1' three times (by ID and ATR1 e.g) and remove duplicates. I later merge the three data-sets to get what I want. However, I think there might be an elegant way to do this. I have about 20 such variables in the original data-set.
Thanks
PROC SQL METHOD
PROC SQL;
CREATE TABLE D2 AS
SELECT ID,
MAX(ATR1) as ATR1,
MAX(ATR2) as ATR2,
MAX(ATR3) as ATR3,
FROM D1
GROUP BY ID;
QUIT;
The GROUP BY clause can also be written GROUP BY 1, omitting ID, as this refers to the 1st column in the SELECT clause.
PROC SUMMARY METHOD
PROC SUMMARY DATA=D1 NWAY;
CLASS ID;
VAR ATR1 ATR2 ATR3;
OUTPUT OUT=D2 (DROP=_TYPE_ _FREQ_) MAX()=;
RUN;
Here's an explanation of some of the options:
NWAY - gives only the maximum level of summarisation, here it's not as important because you have only one CLASS variable, meaning there is only one level of summarisation. However, without NWAY you get an extra row showing the max value of ATR1-ATR3 across the whole dataset, which is not something you asked for in your question.
DROP=_TYPE_ _FREQ_ - This removes the automatic variables:
_TYPE_ - which shows the level of summarisation (see comment above), which would just be a column containing the value 1.
_FREQ_ - gives a frequency count of the ID values, which although useful, isn't something you wanted in your question.

Calculating average weight loss [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
I have created a small data of weight loss
ID Name Team Before after Loss
1 1011 David red 125 112 13
2 1024 Alice red 145 135 10
3 1036 Alan yellow 180 156 24
4 1039 Ashley red 145 130 15
5 1019 Diana yellow 128 109 19
How do I calculate the average loss as well as team wise average loss?
as simple as:
proc sql;
/*average loss*/
select mean(loss) as avgLoss from table;
/*team average loss*/
select team, mean(loss) as avgLoss from table group by 1;
quit;
Use Proc means. This is the default output when using the class statement. The OP doesn't indicate if they want a table or report.
Proc means data=have;
Class team;
Var loss;
Run;
This produces all base statistics at overall and team level. To get only the average, add the keyword mean to the proc statement.
Proc means data=have mean;
Proc summary works in much the same way as proc means...
proc summary data=table;
class team;
var loss;
output out = summrydat
mean = avgloss;
run;
In the output dataset, the first line (having _TYPE_ = 0) gives the total average, whereas subsequent lines (having _TYPE_ = 1) give grouped averages.

SQLite C++ Compare two tables within the same database for matching records

I want to be able to compare two tables within the same SQLite Database using a C++ interface for matching records. Here are my two tables
Table name : temptrigrams
ID TEMPTRIGRAM
---------- ----------
1 The cat ran
2 Compare two tables
3 Alex went home
4 Mark sat down
5 this database blows
6 data with a
7 table disco ninja
++78
Table Name: spamtrigrams
ID TRIGRAM
---------- ----------
1 Sam's nice ham
2 Tuesday was cold
3 Alex stood up
4 Mark passed out
5 this database is
6 date with a
7 disco stew pot
++10000
The first table has two columns and 85 records and the second table has two columns with 10007 records.
I would like to take the first table and compare the records within the TEMPTRIGRAM column and compare it against the TRIGRAM columun in the second table and return the number of matches across the tables. So if (ID:1 'The Cat Ran' appears in 'spamtrigrams', I would like that counted and returned with the total at the end as an integer.
Could somebody please explain the syntax for the query to perform this action?
Thank you.
This is a join query with an aggregation. My guess is that you want the number of matches per trigram:
select t1.temptrigram, count(t2.trigram)
from table1 t1 left outer join
table2 t2
on t1.temptrigram = t2.trigram
group by t1.temptrigram;
If you just want the number of matches:
select count(t2.trigram)
from table1 t1 join
table2 t2
on t1.temptrigram = t2.trigram;