We have a system written in Django to track patients recruited to clinical trials.
Spread sheets are used to record the number of patients recruited each month throughout a financial year; so the sheet only contains 12 months of data even though a study may run for years.
There is a table in a django database in to which the spread sheets are imported each month. The data includes the month/year, a count of patients, and some other fields. Each import will include all the previous months data; we need this to make sure no data has been changed on the import sheet since the last import.
For example, the import table containing two imports (the first up to January and the second up to February) would look like this:
id | study_id | data_date | patient_count | [other fields] -->
100 5456 2016-04-01 10 ...
101 5456 2016-05-01 8 ...
102 5456 2016-06-01 5 ...
... all months in between ...
109 5456 2016-01-01 12 ...
110 5456 2016-02-01 NULL ...
111 5456 2016-03-01 NULL ...
112 5456 2016-04-01 10 ...
113 5456 2016-05-01 8 ...
114 5456 2016-06-01 5 ...
... all months in between ...
121 5456 2016-01-01 12 ...
122 5456 2016-02-01 6 ...
123 5456 2016-03-01 NULL ...
The other fields includes a foreign key to another table containing the actual study identification number (iras_number), so I have to join to that to select the rows for a particular study.
I want the most recent values of data_date and patient_count for a study, which may span more than one financial year, so I tried this query (iras_number is passed to the function performing this query):
totals = ImportStudyData.objects.values('data_date', 'patient_count') \
.filter(import_study__iras_number=iras_number) \
.annotate(max_id=Max('id')).order_by()
However, this produces a SQL query which includes patient_count in the GROUP BY, resulting in duplicate rows:
data_date | patient_count | max_id
2016-04-01 10 100
2016-04-01 10 112
2016-05-01 8 101
2016-05-01 8 113
...
2016-01-01 12 109
2016-01-01 12 121
2016-02-01 NULL 110
2016-02-01 6 122
How do I select the most recent data_date and patient_count from the table using the ORM?
If I were writing the SQL I would do an inner select of the max(id) grouped by data_date and then use that to join, or use an IN query, to select the fields I require from the table; such as:
SELECT data_date, patient_count
FROM importstudydata
WHERE id IN (
SELECT MAX(id) AS "max_id"
FROM importstudydata INNER JOIN importstudy
ON importstudydata.import_study_id = importstudy.id
WHERE importstudy.iras_number = 5456
GROUP BY importstudydata.data_date
)
ORDER BY data_date ASC
I've tried to create an inner select to replicate the SQL query, however the inner select returns more than one field (column) a causes the query to fail:
totals = ImportStudyData.objects.values('data_date', 'patient_count') \
.filter(id__in=ImportStudyData.objects.values('data_date') \
.filter(import_study__iras_number=iras_number) \
.annotate(max_data_id=Max('id'))
Now I can't get the inner select to return only the max(id) grouped by `data_date' and for it to be performed in a single SQL query.
For now I'm splitting the query in to a number of steps to get the result I want.
First I query for the most recent id of all rows related to the study
id_qry = ImportStudyData.objects.values('data_date')\
.filter(import_study__iras_number=iras_number)\
.annotate(max_id=Max('id'))
To get a list of just the numbers, stripping out the date, I use list comprehension:
id_list = [x['max_id'] for x in id_qry]
This list is then used as a filter for the final query to get the number of patients
totals = ImportStudyData.objects.values('data_date', 'patient_count') \
.filter(id__in=id_list)
It hits the database twice, and is computationally more expensive, but for now it works and I need to move on.
I'll come back to this problem at a later date.
Use: distinct=True
totals = ImportStudyData.objects.values('data_date', 'patient_count').filter(import_study__iras_number=iras_number).annotate(max_id=Max('id')).order_by('data_date').distinct()
Related
My problem is related with PowerBI report
It is example table, Real table contains 10000+ results
user
salary
date
1
123
14-10-2022
2
455
11-10-2022
3
333
13-10-2022
4
222
12-10-2022
5
111
10-10-2022
desired output:
user
salary
date
salary (date-1 day)
salary (date-3 days)
1
123
14-10-2022
333
455
2
455
11-10-2022
111
3
333
13-10-2022
222
111
4
222
12-10-2022
455
5
111
10-10-2022
How can I achieve it in PBI ?
I tried to merge tables but the dashboard was very slow after try like that.
I would do this in DAX, not Power Query.
Add a date dimension (search on this if you aren’t familiar with date dimensions), then create these DAX measures.
Salary (date-1 day) = CALCULATE( SUM(TABLE[salary]), DATEADD(DateDim[DateKey],-1,day) )
Salary (date-3 day) = CALCULATE( SUM(TABLE[salary]), DATEADD(DateDim[DateKey],-3,day) )
My raw data stops at sales - looking for some DAX help adding the last two as calculated columns.
customer_id order_id order_date sales total_sales_by_customer total_sales_customer_rank
------------- ---------- ------------ ------- ------------------------- ---------------------------
BM 1 9/2/2014 476 550 1
BM 2 10/27/2016 25 550 1
BM 3 9/30/2014 49 550 1
RA 4 12/18/2017 47 525 3
RA 5 9/7/2017 478 525 3
RS 6 7/5/2015 5 5 other
JH 7 5/12/2017 6 6 other
AG 8 9/7/2015 7 7 other
SP 9 5/19/2017 26 546 2
SP 10 8/16/2015 520 546 2
Lets start with total sales by customer:
total_sales_by_customer =
var custID = orders[customer_id]
return CALCULATE(SUM(orders[sales], FILTER(orders, custID = orders[customer_id]))
first we get the custID, filter the orders table on this ID and sum it together per customer.
Next the ranking:
total_sales_customer_rank =
var rankMe = RANKX(orders, orders[total_sales_by_customer],,,Dense)
return if (rankMe > 3, "other", CONVERT(rankMe, STRING))
We get the rank per cust sales (gotten from first column), if it is bigger than 3, replace by "other"
On your first question: DAX is not like a programming language. Each row is assessed individual. Lets go with your first row: your custID will be "BM".
Next we calculate the sum of all the sales. We filter the whole table on the custID and sum this together. So in the filter we have actualty only 3 rows!
This is repeated for each row, seems slow but I only told this so you can understand the result you are getting back. In reality there is clever logic to return data fast.
What you want to do "Orders[Customer ID]=Orders[Customer ID]" is not possible because your Orders[Customer ID] is within the filter and will run with the rows..
var custid = VALUES(Orders[Customer ID]) Values is returning a single column table, you can not use this in a filter because you are then comparing a cell value with a table.
I am currently trying to create a report that shows how customers behave over time, but instead of doing this by date, I am doing it by customer age (number of months since they first became a customer). So using a date field isn't really an option, considering one customer may have started in Dec 2016 and another starts in Jun 2017.
What I'm trying to find is the month-over-month change in units purchased. If I was using a date field, I know that I could use
[Previous Month Total] = CALCULATE(SUM([Total Units]), PREVIOUSMONTH([FiscalDate]))
I also thought about using EARLIER() to find out but I don't think it would work in this case, as it requires row context that I'm not sure I could create. Below is a simplified version of the table that I'll be using.
ID Date Age Units
219 6/1/2017 0 10
219 7/1/2017 1 5
219 8/1/2017 2 4
219 9/1/2017 3 12
342 12/1/2016 0 500
342 1/1/2017 1 280
342 2/1/2017 2 325
342 3/1/2017 3 200
342 4/1/2017 4 250
342 5/1/2017 5 255
How about something like this?
PrevTotal =
VAR CurrAge = SELECTEDVALUE(Table3[Age])
RETURN CALCULATE(SUM(Table3[Units]), ALL(Table3[Date]), Table3[Age] = CurrAge - 1)
The CurrAge variable gives the Age evaluated in the current filter context. You then plug that into a filter in the CALCULATE line.
I've been trying to produce a result where multiple queries return more restrictive returns. How can I see the full list as well as those records that meet the more restrictive conditions? Query 1 returns 538 records of sites in the given counties.
SELECT E_SITES.ID "SITE ID",
E_SITES.NAME "SITE NAME",
E_SITES.ADDR_1 "SITE ADDRESS"
E_SITES.CITY_NAME || ', ' || E_SITES.STATE_CODE || ' ' || E_SITES.POSTAL_CODE,
E_SITES.COUNTY_NAME
FROM E_SITES
WHERE E_SITES.COUNTY_NAME IN ('ALLAMAKEE', 'BENTON', 'BLACK HAWK', 'BREMER', 'BUCHANAN', 'CHICKASAW', 'CLAYTON', 'DELAWARE', 'DUBUQUE')
ORDER BY E_SITES.ID
Query 2 returns the number of sites that have a contact person identified. This is 503 records.
SELECT E_SITES.ID "SITE ID",
E_SITES.NAME "SITE NAME",
E_SITES.ADDR_1 "SITE ADDRESS"
E_SITES.CITY_NAME || ', ' || E_SITES.STATE_CODE || ' ' || E_SITES.POSTAL_CODE,
E_SITES.COUNTY_NAME,
E_INDIVIDUALS.FIRST_NAME || ' ' || E_INDIVIDUALS.LAST_NAME
FROM E_SITES, E_AFFILIATIONS, E_INDIVIDUALS
WHERE E_SITES.SITE_ID = E_AFFILIATIONS.SITE_ID
AND E_AFFILIATIONS.INDIVIDUAL_RID = E_INDIVIDUALS.RID
AND E_AFFILIATIONS.AFFILIATION_TYPE = ('SITE_CONTACT')
AND E_SITES.COUNTY_NAME IN ('ALLAMAKEE', 'BENTON', 'BLACK HAWK', 'BREMER', 'BUCHANAN', 'CHICKASAW', 'CLAYTON', 'DELAWARE', 'DUBUQUE')
ORDER BY E_SITES.ID
A further query would return those sites with a mailing address, which reduces the results down to 486 records. I need to get all 538 records, whether or not they have a contact or mailing address, and for those that do, have one row for each site.
Additional Information
My current results can look like this for Query 1 (including column headers for clarity, quotes to distinguish data elements):
"SITE ID" "SITE NAME" "SITE ADDRESS" "CITY, STATE ZIP" "COUNTY_NAME"
"09698" "BODINE ELECTRIC" "18114 KAPP DR" "PEOSTA, IA 52067" "BREMER"
"16895" "BRUGGEMAN LUMBER" "3003 WILLOW RD" "HOPKINTON, IA 52237" "DELAWARE"
"40047" "GENEVIEVE, LLC" "707 LINCOLN ST" "GARNAVILLOR, IA 52052" "CLAYTON"
Query 2 which requires a contact person currently only returns records that meet the requirement, even though I use the (+) operator.
"SITE ID" "SITE NAME" "SITE ADDRESS" "CITY, STATE ZIP" "COUNTY_NAME" "FIRST NAME LAST NAME"
"40047" "GENEVIEVE, LLC" "707 LINCOLN ST" "GARNAVILLOR, IA 52052" "CLAYTON" "DALE KARTMAN"
I get 1 record rather than the 3 records, with 2 having no contact person and 1 with a contact person. This is my dilema. I have to run each of these queries separately, get the results and copy them to a spreadsheet. Then I have to align the records with contact names to the 1st query of all facilities. Very labor intensive. Hope this helps clarify my needs.
If I understood you correctly, it is the OUTER JOIN you're looking for.
Here's a simple example (based on Scott's EMP and DEPT tables) which shows what it is.
There are 4 departments in the DEPT table:
SQL> select deptno from dept order by deptno;
DEPTNO
----------
10
20
30
40
However, no employee works in department 40:
SQL> select deptno, ename from emp order by deptno;
DEPTNO ENAME
---------- ----------
10 KING
10 CLARK
10 MILLER
20 FORD
20 SMITH
20 JONES
30 JAMES
30 TURNER
30 MARTIN
30 WARD
30 ALLEN
30 BLAKE
12 rows selected.
SQL>
If you want to display information collected from both of those tables (department name from the DEPT table and employee name from the EMP table), you'd join those tables - just like you did (I'll use ANSI syntax which actually JOINS tables, instead of enumerating them and putting join conditions into the WHERE clause):
SQL> select d.deptno, d.dname, e.ename
2 from dept d join emp e on e.deptno = d.deptno
3 order by d.deptno;
DEPTNO DNAME ENAME
---------- -------------- ----------
10 ACCOUNTING KING
10 ACCOUNTING CLARK
10 ACCOUNTING MILLER
20 RESEARCH FORD
20 RESEARCH SMITH
20 RESEARCH JONES
30 SALES JAMES
30 SALES TURNER
30 SALES MARTIN
30 SALES WARD
30 SALES ALLEN
30 SALES BLAKE
12 rows selected.
SQL>
Looks OK, but - I'd like to get information about DEPTNO = 40, although nobody works in it. So, use outer join:
SQL> select d.deptno, d.dname, e.ename
2 from dept d left join emp e on e.deptno = d.deptno
3 order by d.deptno;
DEPTNO DNAME ENAME
---------- -------------- ----------
10 ACCOUNTING KING
10 ACCOUNTING CLARK
10 ACCOUNTING MILLER
20 RESEARCH FORD
20 RESEARCH SMITH
20 RESEARCH JONES
30 SALES JAMES
30 SALES TURNER
30 SALES MARTIN
30 SALES WARD
30 SALES ALLEN
30 SALES BLAKE
40 OPERATIONS
13 rows selected.
SQL>
Right! Here it is! (note that LEFT JOIN produces the same result as LEFT OUTER JOIN; no need to specify "outer", although it makes thinks somewhat more obvious).
Also, there's the "old" Oracle outer join operator, (+) (literally, a + sign enclosed into round brackets). The above query would work as well if we put it like this:
select d.deptno, d.dname, e.ename
from dept d, emp e
where d.deptno = e.deptno (+);
I'd suggest you do the same with (outer join) your query. Once again:
join tables in the JOIN clause
put filters into the WHERE clause
Query will be easier to read and maintain, you'll know what is what, and - if necessary (and it might even be the case for you), if you use the "old" (+) operator, you won't be able to outer join one table to more than just one another table. As you're going deeper and deeper, you might need to outer join some table to several others, and that's where ANSI join takes place.
Good luck!
I have large dataset of a few million patient encounters that include a diagnosis, timestamp, patientID, and demographic information.
We have found that a particular type of disease is frequently comorbid with a common condition.
I would like to count the number of this type of disease that each patient has, and then create a histogram showing how many people have 1,2,3,4, etc. additional diseases.
This is the format of the data.
PatientID Diagnosis Date Gender Age
1 282.1 1/2/10 F 25
1 282.1 1/2/10 F 87
1 232.1 1/2/10 F 87
1 250.02 1/2/10 F 41
1 125.1 1/2/10 F 46
1 90.1 1/2/10 F 58
2 140 12/15/13 M 57
2 282.1 12/15/13 M 41
2 232.1 12/15/13 M 66
3 601.1 11/19/13 F 58
3 231.1 11/19/13 F 76
3 123.1 11/19/13 F 29
4 601.1 12/30/14 F 81
4 130.1 12/30/14 F 86
5 230.1 1/22/14 M 60
5 282.1 1/22/14 M 46
5 250.02 1/22/14 M 53
Generally, I was thinking of a DO loop, but I'm not sure where to start because there are duplicates in the dataset, like with patient 1 (282.1 is listed twice). I'm not sure how to account for that. Any thoughts?
Target diagnoses to count would be 282.1, 232.1, 250.02. In this example, patient 1 would have a count of 3, patient 2 would have 2, etc.
Edit:
This is what I have used, but the output is showing each PatientID on multiple lines in the output.
PROC SQL;
create table want as
select age, gender, patientID,
count(distinct diagnosis_description) as count
from dz_prev
where diagnosis in (282.1, 232.1)
group by patientID;
quit;
This is what the output table looks like. Why is this patientID showing up so many times?
Obs AGE GENDER PATIENTID count
1 55 Male 107828695 1
2 54 Male 107828695 1
3 54 Male 107828695 1
4 54 Male 107828695 1
5 54 Male 107828695 1
If you include variables that are neither grouping variables or summary statistics then SAS will happily re-merge your summary statistics back with all of the source records. That is why you are getting multiple records. AGE can usually vary if your dataset covers many years. And GENDER can also vary if your data is messy. So for a quick analysis you might try something like this.
create table want as
select patientID
, min(age) as age_at_onset
, min(gender) as gender
, count(distinct diagnosis_description) as count
from dz_prev
where diagnosis in (282.1, 232.1)
group by patientID
;
I think you can get what you want with an SQL statement
PROC SQL NOPRINT;
create table want as
select PatientID,
count(distinct Diagnosis) as count
from have
where Diagnosis in (282.1, 232.1, 250.02)
group by PatientID;
quit;
This filters to only the diagnoses you are interested in, counts the distinct times they are seen, by the PatientID, and saves the results to a new table.