How do I JOIN a CTE with the rest of my query? - common-table-expression

I'm trying to get the first occurrence date in mytable2 and join it with mytable1.
For example:
SELECT userid, occurrence_date
FROM (
SELECT
userid, occurrence_date,
row_number () over (partition by userid order by occurrence_date) rn
FROM mytable2
) cte where rn = 1;
How do I combine that with this query:
SELECT userid, fieldname, event_date, count(transactions)
FROM mytable1
WHERE YEAR(event_date) >= '2020'
GROUP BY userid, fieldname, event_date;
This is my first time working with CTEs, so I don't really understand how to put them together.
FROM mytable1
LEFT JOIN mytable2 ON mytable1.userid = mytable2.userid
here is some sample data:
mytable1
-------------------------------------------------
|userid|fieldname|event_date|count(transactions)|
-------------------------------------------------
| 1 |limes |05/10/2020| 4 |
-------------------------------------------------
| 1 |potatoes |05/10/2020| 3 |
-------------------------------------------------
| 2 |pears |02/15/2020| 8 |
-------------------------------------------------
| 2 |pineapple|03/02/2020| 6 |
-------------------------------------------------
| 2 |oranges |03/05/2020| 10 |
-------------------------------------------------
mytable2
------------------------------------
|userid| occurrence_date |
------------------------------------
| 1 |04/20/2019 01:12:00.000 |
| 1 |04/20/2019 01:12:15.010 |
| 1 |05/10/2020 05:15:33.020 |
| 1 |05/10/2020 05:16:23.011 |
| 2 |03/25/2018 07:33:16.013 |
| 2 |02/15/2020 09:15:30.223 |
| 2 |03/02/2020 11:24:16.210 |
| 2 |03/05/2020 10:30:16.123 |
------------------------------------
mytable3 (result table)... taking occurrence_date from mytable2 and calling it acquisition_date in mytable3 because it is the earliest transaction date.
-------------------------------------------------------------------------
|userid|fieldname|event_date|count(transactions)|acquisition_date |
-------------------------------------------------------------------------
| 1 |limes |05/10/2020| 4 |04/20/2019 01:12:00.000|
-------------------------------------------------------------------------
| 1 |potatoes |05/10/2020| 3 |04/20/2019 01:12:00.000|
-------------------------------------------------------------------------
| 2 |pears |02/15/2020| 8 |03/25/2018 07:33:16.013|
-------------------------------------------------------------------------
| 2 |pineapple|03/02/2020| 6 |03/25/2018 07:33:16.013|
-------------------------------------------------------------------------
| 2 |oranges |03/05/2020| 10 |03/25/2018 07:33:16.013|
-------------------------------------------------------------------------

I figured it out:
with cte AS
(
SELECT *,
ROW_NUMBER() OVER (PARTITION BY userid ORDER BY occurrence_date) AS rn
FROM mytable2
)
SELECT cte.occurrence_date, mytable1.userid, mytable1.fieldname, mytable1.event_date, COUNT(transactions)
FROM mytable1
LEFT JOIN cte ON mytable1.userid = cte.userid and cte.rn = 1
WHERE YEAR(mytable1.event_date) >= '2020'
GROUP BY cte.occurrence_date, mytable1.userid, mytable1.fieldname, mytable1.event_date;

Related

Calculate totals within custom SQL query (Tableau)

I have a dataset where I wish to reflect the totals from a custom SQL query I performed in Tableau. Here is some sample data:
1. I first performed a custom query that was a join, unpivot and placed my data into groups
Size Tb Val type Group Sum_AVG SKU Last_Refreshed
270 90.5 Free_Space_TB Group2 90.5 Excel 9/1/2020
270 179.5 Used Group2 179.5 Excel 9/1/2020
814 701 Free_Space_TB Group1 701 Gris 8/1/2020
814 112 Used Group1 112 Gris 8/1/2020
2. Then I aggregated the data by taking the sum of one group and the average of the other group (and final summed these groups values)
The data is being aggregated like this: (SUM_AVG)
zn(sum(if [Group]= 'Group1' then [Val] end))
+
zn(avg(if [Group] = 'Group2' then [Val] end))
The view looks like this
Here is the custom query output
Here is my view
The avail and used appear when I hover over, but how would I include the total?
This is the calculation I am using (thanks to help from a SO member):
{SUM({Fixed [type]: ZN(sum(if [Group]= 'Group1' then [Val] end))})
+
sum({Fixed [type]: zn(avg(if [Group] = 'Group2' then [Val] end))})}
I am doing something wrong, because it is totaling up across all the column(s), (I have more columns in the full dataset) when I just want the total for each column.
(Used was created from using a custom query)
Any assistance is appreciated.
In my opinion, this you can do without changing the underlying view. WINDOW_SUM is a table calculation and is always dependent on view/context generated. Therefore, I always prefer LOD calculations which do not depend on context.
I think you should proceed like this. As always I have changed the sample data to include sufficient details
Data used
| Id | Avail | group | used | Date |
|----|-------|--------|------|------------|
| A | 5 | Group1 | 5 | 20-01-2020 |
| A | 20 | Group1 | 20 | 20-01-2020 |
| B | 10 | Group2 | 10 | 20-01-2020 |
| B | 5 | Group2 | 5 | 20-01-2020 |
| B | 5 | Group2 | 5 | 20-01-2020 |
| A | 10 | Group1 | 10 | 20-01-2020 |
| A | 10 | Group1 | 10 | 20-01-2020 |
| B | 5 | Group2 | 5 | 20-01-2020 |
| B | 5 | Group2 | 5 | 20-01-2020 |
| A | 5 | Group1 | 5 | 20-02-2019 |
| A | 20 | Group1 | 20 | 20-02-2019 |
| B | 10 | Group2 | 10 | 20-02-2019 |
| B | 5 | Group2 | 5 | 20-02-2019 |
| B | 5 | Group2 | 5 | 20-02-2019 |
| A | 10 | Group1 | 10 | 20-02-2019 |
| A | 10 | Group1 | 10 | 20-02-2019 |
| B | 5 | Group2 | 5 | 20-02-2019 |
| B | 5 | Group2 | 5 | 20-02-2019 |
Step-1 Pivot generated in tableau as earlier.
Step-2 Calculated field sum-avg also generated as discussed.
step-3 View generated
Step-4 Add another field total
{FIXED [Date], [Group]: sum(
{FIXED [Date], [Group], [type]: zn(sum(if [Group]= 'Group1' then [val] end))}
+
{Fixed [Date], [Group], [type]: zn(avg(if [Group] = 'Group2' then [val] end))}
)}
Step-5 Add this field to details on marks card. See the GIF here
the code used in tooltip is mentioned below. Obviously, you can tweak it as per taste.
Under the <Group> , <AGG(Sum_Avg)> was <type> out of total <SUM(Total)> SKU on <YEAR(Date)>
This solution works:
1.Create a calculated field:
WINDOW_SUM([SUM_AVG])
2.Drag newly computed field to the view
3.Right click ‘Edit Table Calculation’
4.Specify and compute using [Last_Refreshed] and [type]
This will allow you to compute across cells, giving you your desired result

In SAS, how do you stop flagging a group of rows if a specific condition is met?

I have a table in SAS dataset that looks like this:
proc sql;
create table my_table
(id char(1),
my_date num format=date9.,
my_col num);
insert into my_table
values('A','01JAN2010'd,.)
values('A','02JAN2010'd,0)
values('A','03DEC2009'd,1)
values('A','04NOV2009'd,1)
values('B','01JAN2010'd,.)
values('B','02NOV2009'd,2)
values('C','01JAN2010'd,.)
values('C','02OCT2009'd,3)
values('D','01JAN2010'd,.)
values('D','02NOV2009'd,2)
values('D','03OCT2009'd,1)
values('D','04AUG2009'd,2)
values('D','05MAY2009'd,3)
values('D','06APR2009'd,1);
quit;
I am trying to create a new column desired that, for each group of id column, flags the row with a value of 1 if the value in my_col is missing or less than 3.
The part I'm having trouble with is that when there is a my_col value that is greater than 2, I need the desired value for that row to be missing and also stop flagging any remaining rows in the id group with a value of 1.
The resulting dataset should look like this:
+----+-----------+--------+---------+
| id | my_date | my_col | desired |
+----+-----------+--------+---------+
| A | 01JAN2010 | . | 1 |
| A | 02JAN2010 | 0 | 1 |
| A | 03DEC2009 | 1 | 1 |
| A | 04NOV2009 | 1 | 1 |
| B | 01JAN2009 | . | 1 |
| B | 02NOV2009 | 2 | 1 |
| C | 01JAN2010 | . | 1 |
| C | 02OCT2009 | 3 | . |
| D | 01JAN2010 | . | 1 |
| D | 02NOV2009 | 2 | 1 |
| D | 03OCT2009 | 1 | 1 |
| D | 04AUG2009 | 2 | 1 |
| D | 05MAY2009 | 3 | . |
| D | 06APR2009 | 1 | . |
+----+-----------+--------+---------+
Looks like a simple application of a retained variable. Set the flag to 1 when you start a new group and then set it to missing when the value of MY_COL is larger than 2.
data want;
set my_table ;
by id;
if first.id then desired=1;
if my_col>2 then desired=.;
retain desired;
run;
Also it is not clear why you used such complicated code to create your example data. Why not a simple data step?
data my_table;
input id :$1. my_date :date. my_col;
format my_date date9.;
cards;
A 01JAN2010 .
A 02JAN2010 0
A 03DEC2009 1
A 04NOV2009 1
B 01JAN2010 .
B 02NOV2009 2
C 01JAN2010 .
C 02OCT2009 3
D 01JAN2010 .
D 02NOV2009 2
D 03OCT2009 1
D 04AUG2009 2
D 05MAY2009 3
D 06APR2009 1
;
I can't think of a simpler way to do it, but this works. You will need to have your data sorted by id.
data my_table2;
set my_table;
by id;
format gt2flag $1.;
retain gt2flag;
if first.id then gt2flag='';
if my_col gt 2 then gt2flag='Y';
if gt2flag = 'Y' then desired=.;
else desired=1;
drop gt2flag;
run;
id my_date my_col desired
A 01JAN2010 . 1
A 02JAN2010 0 1
A 03DEC2009 1 1
A 04NOV2009 1 1
B 01JAN2010 . 1
B 02NOV2009 2 1
C 01JAN2010 . 1
C 02OCT2009 3 .
D 01JAN2010 . 1
D 02NOV2009 2 1
D 03OCT2009 1 1
D 04AUG2009 2 1
D 05MAY2009 3 .
D 06APR2009 1 .

Calculating if Start time occurs within 1 hour range for each person (Single column)

I'm trying to figure out how to calculate if start time for each subject occurs within 1 hour of each other. However I only have one column and two groups with two different dates for each. I have no comparative variable to a dhms time difference as they occur under the same column variable. I have thought of doing a lag on the first time and then an intchk to calculate the 24 hour time difference between each but I don't think i have sufficient arguments for the intchk function. Alternatively could maybe do a proc transpose and then do a timediff between each array variable but that seems messy. Anyone have less clunky and more efficient solutions as i might be overthinking this.
Sample Data:
+----------+-------+------+------------+------------+
| CLIENTID | GRPID | date | start_date | start_time |
+----------+-------+------+------------+------------+
| 2 | 1 | -2 | 10Nov2019 | 23:19:52 |
| 3 | 1 | -2 | 10Nov2019 | 23:22:51 |
| 4 | 1 | -2 | 10Nov2019 | 23:20:16 |
| 5 | 1 | -2 | 10Nov2019 | 23:21:30 |
| 6 | 1 | -2 | 10Nov2019 | 23:23:51 |
| 23 | 2 | -2 | 11Nov2019 | 23:11:38 |
| 24 | 2 | -2 | 11Nov2019 | 23:38:33 |
| 25 | 2 | -2 | 11Nov2019 | 23:15:01 |
| 26 | 2 | -2 | 11Nov2019 | 23:08:43 |
+----------+-------+------+------------+------------+
You can compile the start date and time into a temporary datetime variable (_start_dt) to ease the comparison. Then, taking the first datetime for each GRPID as the baseline, you could use a RETAIN statement to pass that baseline datetime (_base_dt) down the related data rows and find the time difference (time_diff) using the INTCK function with a dtsecond interval.
proc sort data=your_data;
by grpid clientid;
run;
data your_results (drop=_:);
retain CLIENTID GRPID DATE start_date start_time _base_dt;
format _base_dt _start_dt datetime16. time_diff time8.;
set your_data;
by grpid clientid;
_start_dt = dhms(start_date,hour(start_time),minute(start_time),second(start_time));
if first.grpid then _base_dt = _start_dt;
time_diff = intck('dtsecond', _base_dt, _start_dt);
run;
This gives the following results dataset:
+----------+-------+------+------------+------------+-----------+
| CLIENTID | GRPID | date | start_date | start_time | time_diff |
+----------+-------+------+------------+------------+-----------+
| 2 | 1 | -2 | 10Nov2019 | 23:19:52 | 00:00:00 |
| 3 | 1 | -2 | 10Nov2019 | 23:22:51 | 00:02:59 |
| 4 | 1 | -2 | 10Nov2019 | 23:20:16 | 00:00:24 |
| 5 | 1 | -2 | 10Nov2019 | 23:21:30 | 00:01:38 |
| 6 | 1 | -2 | 10Nov2019 | 23:23:51 | 00:03:59 |
| 23 | 2 | -2 | 11Nov2019 | 23:11:38 | 00:00:00 |
| 24 | 2 | -2 | 11Nov2019 | 23:38:33 | 00:26:55 |
| 25 | 2 | -2 | 11Nov2019 | 23:15:01 | 00:03:23 |
| 26 | 2 | -2 | 11Nov2019 | 23:08:43 | -0:02:55 |
+----------+-------+------+------------+------------+-----------+
I think I’ve interpreted your requirements correctly.. Let me know if not.
It sounds like you want to check if the RANGE of the start_time over each group is < 1 hour:
Coerce the start_date to a datetime value and add the start_time before computing the range.
data have;
input
CLIENTID GRPID date start_date: date9. start_time: hhmmss6.;
format start_date date9. start_time time8.;
datalines;
2 1 -2 10Nov2019 23:19:52
3 1 -2 10Nov2019 23:22:51
4 1 -2 10Nov2019 23:20:16
5 1 -2 10Nov2019 23:21:30
6 1 -2 10Nov2019 23:23:51
23 2 -2 11Nov2019 23:11:38
24 2 -2 11Nov2019 23:38:33
25 2 -2 11Nov2019 23:15:01
26 2 -2 11Nov2019 23:08:43
run;
proc sql;
create table want (label="start range status by group") as
select
grpid,
range(dhms(start_date,0,0,0)+start_time) as start_range format time8.,
calculated start_range < '24:00:00't as one_hr_start_flag
from have
group by grpid;
If you want to disregard the groups and focus only on the time of day, disregarding the date, the range computation would be:
* Presuming 'noon' is the center of the day;
proc sql;
create table want (label="time of day start range status overall") as
select
range(start_time) as range format time8.,
calculated range < '24:00:00't as one_hr_start_flag
from have;
Looking at only time is always troublesome for the cases of when the time value is slightly after midnight.

How to sum up a measure based on different levels in Power BI using DAX

I have the following table structure:
| Name 1 | Name 2 | Month | Count 1 | Count 2 | SumCount |
|--------|--------|--------|---------|---------|----------|
| A | E | 1 | 5 | 3 | 8 |
| A | E | 2 | 1 | 6 | 7 |
| A | F | 3 | 3 | 4 | 7 |
Now I calculate the following with a DAX measure.
Measure = (sum(Table[Count 2] - sum(Table[Count 1])) * sum(Table[SumCount])
I can't use a column because then the formula is applied before excluding a layer (eg. month). Added to my table structure and excluded month it would look like that:
| Name 1 | Name 2 | Count 1 | Count 2 | SumCount | Measure |
|--------|--------|---------|---------|----------|---------|
| A | E | 6 | 9 | 15 | 45 |
| A | F | 3 | 4 | 7 | 7 |
I added a table to the view which only displays Name 1in which case the measure of course will sum up Count 1, Count 2 and SumCount and applies the measure which leads to the following result:
| Name 1 | Measure |
|--------|---------|
| A | 88 |
But the desired result should be
| Name 1 | Measure |
|--------|---------|
| A | 52 |
which is the sum of Measure.
So basically I want to have the calculation on my base level Measure = (sum(Table[Count 1] - sum(Table[Count 2])) * sum(Table[SumCount]) but when drilling up and grouping those names it should only perform a sum.
An iterator function like SUMX is what you want here since you are trying to sum row by row rather than aggregating first.
Measure = SUMX ( Table, ( Table[Count 2] - Table[Count 1] ) * Table[SumCount] )
Any filters you have will be applied to the first argument, Table, and it will only sum the corresponding rows.
Edit:
If I'm understanding correctly, you want to aggregate over Month before taking the difference and product. One way to do this is by summarizing (excluding Month) before using SUMX like this:
Measure =
VAR Summary =
SUMMARIZE (
Table,
Table[Name 1],
Table[Name 2],
"Count1Sum", SUM ( Table[Count 1] ),
"Count2Sum", SUM ( Table[Count 2] ),
"SumCountSum", SUM ( Table[SumCount] )
)
RETURN
SUMX ( Summary, ( [Count2Sum] - [Count1Sum] ) * [SumCountSum] )
You don't want measure in this case, rather you need new column,
Same formula but new column will give your desired result.
Column = ('Table (2)'[Count1]-'Table (2)'[Count2])*'Table (2)'[SumCount]

Power BI - weighted average yield across 2 tables of a given date

I would like to calculate average yield between two relation tables of a given date
Table1 Table2
+-------------------------------+ +-------------------------------+
| ID TradeDate Amount | | ID TradeDate Yield |
+-------------------------------+ +-------------------------------+
| 1 2018/11/30 100 | | 1 2018/11/8 2.2% |
| 1 2018/11/8 101 | | 1 2018/8/8 2.1% |
| 1 2018/10/31 102 | | 1 2018/5/8 2.0% |
| 1 2018/9/30 103 | | 2 2018/9/8 1.7% |
| 2 2018/11/30 200 | | 2 2018/6/8 1.6% |
| 2 2018/10/31 203 | | 2 2018/3/8 1.5% |
| 2 2018/9/30 205 | | 3 2018/10/20 1.7% |
| 3 2018/11/30 300 | | 3 2018/7/20 1.6% |
| 3 2018/10/31 300 | | 3 2018/4/20 1.6% |
| 3 2018/9/30 300 | +-------------------------------+
+-------------------------------+
I create a table named 'DateList' and use slicer to select a specified date.
Screen Shot DateList.
I want to achieve the following result:
as of *11/9/2018*
+-----------------------------------------------------------------+
| ID LastDate Value LatestYieldDate LastYield |
+-----------------------------------------------------------------+
| 1 2018/11/8 101 2018/11/8 2.2% |
| 2 2018/10/31 203 2018/9/8 1.7% |
| 3 2018/10/31 300 2018/10/20 1.7% |
+-----------------------------------------------------------------+
| Total 604 1.7836% |
+-----------------------------------------------------------------+
Currently, I use the following formula to achieve the partial result
Create 2 measures in table1
LastDate =
VAR SlicerDate = MIN(DateList[Date])
VAR MinDiff =
MINX(FILTER(ALL(Table1),Table1[ID] IN VALUES(Table1[ID])),
ABS(SlicerDate - Table1[TradeDate]))
RETURN
MINX(FILTER(ALL(Table1),Table1[ID] IN VALUES(Table1[ID])
&& ABS(SlicerDate - Table1[TradeDate]) = MinDiff),
Table1[TradeDate])
Value = CALCULATE(SUM(Table1[Amount]), FILTER(Table1, Table1[TradeDate] = [LastDate]))
Create 2 measures in table2
LastYieldDate =
VAR SlicerDate = MIN(DateList[Date])
VAR MinDiff =
MINX(FILTER(ALL(Table2),Table2[ID] IN VALUES(Table2[ID])),
ABS(SlicerDate - Table2[TradeDate]))
RETURN
MINX(FILTER(ALL(Table2),Table2[ID] IN VALUES(Table2[ID])
&& ABS(SlicerDate - Table2[TradeDate]) = MinDiff),
Table2[TradeDate])
LastYield = CALCULATE(SUM(Table2[Yield]), FILTER(Table2,
Table2[TradeDate] = [LastYieldDate]))
I have no idea to calculate right average yield between 2 tables
Here is my current result.
Screen Shot Current Result.
You'll first need to create a bridge table for the ID values so you can work with both tables more easily.
IDList = VALUES(Table1[ID])
Now we'll use IDList[ID] on our visual instead of the ID from one of the other tables.
The measure we use for the average last yield is a basic sum-product average:
LastYieldAvg =
DIVIDE(
SUMX(IDList, [Value] * [LastYield]),
SUMX(IDList, [Value])
)
Note that when there is only a single ID value, it simplifies to
[Value] * [LastYield] / [Value] = [LastYield]