Minimum per subgroup in stata - stata

In Stata, I want to calculate the minimum and maximum for subgroups per country and year, while the result should be in every observation.
Ulitmately, I want to have the difference between min and max as a separate variable.
Here is an example for my dataset:
country
year
oranges
type
USA
2021
100
1
USA
2021
200
0
USA
2021
900
0
USA
2022
500
1
USA
2022
300
0
Canada
2022
300
0
Canada
2022
400
1
The results should look like this:
country
year
oranges
type
min(tpye=1)
max(type=0)
distance
USA
2021
100
1
100
900
800
USA
2021
200
0
100
900
800
USA
2021
900
0
100
900
800
USA
2022
500
1
500
300
-200
USA
2022
300
0
500
300
-200
Canada
2022
300
0
400
300
-100
Canada
2022
400
1
400
300
-100
So far, I tried the following code:
bysort year country: egen smalloranges = min(oranges) if type == 1
bysort year country: egen bigoranges = max(oranges) if type == 0
gen distance = bigoranges - smalloranges

I would approach this directly, as follows:
* Example generated by -dataex-. For more info, type help dataex
clear
input str6 country int(year oranges) byte type
"USA" 2021 100 1
"USA" 2021 200 0
"USA" 2021 900 0
"USA" 2022 500 1
"USA" 2022 300 0
"Canada" 2022 300 0
"Canada" 2022 400 1
end
egen min = min(cond(type == 1, oranges, .)), by(country year)
egen max = max(cond(type == 0, oranges, .)), by(country year)
gen wanted = max - min
list, sepby(country year)
b +------------------------------------------------------+
| country year oranges type min max wanted |
|------------------------------------------------------|
1. | USA 2021 100 1 100 900 800 |
2. | USA 2021 200 0 100 900 800 |
3. | USA 2021 900 0 100 900 800 |
|------------------------------------------------------|
4. | USA 2022 500 1 500 300 -200 |
5. | USA 2022 300 0 500 300 -200 |
|------------------------------------------------------|
6. | Canada 2022 300 0 400 300 -100 |
7. | Canada 2022 400 1 400 300 -100 |
+------------------------------------------------------+
For more discussion, see Section 9 of https://www.stata-journal.com/article.html?article=dm0055

I am not sure if I understand the purpose of type 1 and 0, but this generates the exact result you describe in the tables. It might seem convoluted to create temporary files like this, but I think it modularizes the code into clean blocks.
* Example generated by -dataex-. For more info, type help dataex
clear
input str6 country int(year oranges) byte type
"USA" 2021 100 1
"USA" 2021 200 0
"USA" 2021 900 0
"USA" 2022 500 1
"USA" 2022 300 0
"Canada" 2022 300 0
"Canada" 2022 400 1
end
tempfile min1 max0
* Get min values for type 1 in each country-year
preserve
keep if type == 1
collapse (min) min_type_1=oranges , by(country year)
save `min1'
restore
* Get max values for type 0 in each country-year
preserve
keep if type == 0
collapse (max) max_type_0=oranges , by(country year)
save `max0'
restore
* Merge the min and the max
merge m:1 country year using `min1', nogen
merge m:1 country year using `max0', nogen
* Calculate distance
gen distance = max_type_0 - min_type_1

Related

Is there a recommended Power BI DAX pattern for calculating monthly Days Sales Outstanding (a.k.a. Debtor Days) using the Countback method?

Is there a recommended Power BI DAX pattern for calculating monthly Days Sales Outstanding (a.k.a. DSO or Debtor Days) using the Countback method?
I have been searching for a while and although there are many asking about it, there is no working solution recommendation I can find. I think that is perhaps because nobody has set out the problem properly so I am going to try to explain as fully as possible.
DSO is a widely-used management accounting measure of the average number of days that it takes a business to collect payment for its credit sales. More background info on the metric here: https://www.investopedia.com/terms/d/dso.asp
There are various options for defining the calculation. I believe my requirement is known as the countback method. My data set is a fairly large star schema with a separate date dimension, but using the below simplified data set to generate a solution would totally point me in the right direction.
Input data set as follows:
Month No
Month
Days in Month
Debt Balance
Gross Income
1
Jan
31
1000
700
2
Feb
28
1100
500
3
Mar
31
900
400
4
Apr
30
950
600
5
May
31
1000
400
6
Jun
30
1100
550
7
Jul
31
900
700
8
Aug
31
950
500
9
Sep
30
1000
400
10
Oct
31
1100
600
11
Nov
30
900
400
12
Dec
31
950
550
The aim is to create a measure for debtor days equal to the number of days of average daily income per month we need to count back to match the debt balance.
Starting with Dec as an example in 3 steps:
Debt Balance= 950, income = 550. Dec has 31 days. So we take all
31 days of income and reduce the debt balance to 400 (i.e. 950 - 550) and go back to the previous month.
Remaining Dec Debt balance =
400. Nov Income = 700. We don't need all of the daily income from Nov to match the rest of the Dec debt balance. 400/700 x 30 days in
Nov = 17.14 days
We have finished counting back days. 31 + 17.14 = 48.14 debtor days
Nov has a higher balance so we need 1 more step:
Debt balance= 1500, income = 700. Nov has 30 days. So we take all 30 days of income and reduce the debt balance to 800 (i.e. 1500 - 700) and go back to the previous month.
Remaining Nov Debt balance = 800. Oct Income = 600. Oct has 31 days. So we take all 31 days of income from Oct and reduce the Nov debt balance to 200 (i.e. 1500 - 700 - 600)
Remaining Nov debt balance = 200. Sep Income = 400. We don't need all of the daily income from Sep to match the rest of the Nov debt balance. 200/400 x 30 days in Sep = 15 days
We have finished counting back days. 30 + 31 + 15 = 76 debtor days
Apr has a lower balance so can be resolved in one step:
Debt Balance = 400, income = 600. Apr has 30 days. We don't need all of Apr Income as income exceeds debt in this month. 400/600 * 30 = 20 debtor days
The required solution for Debtor days in the simplified data set is therefore shown in the right-most "Debtor Days" column as follows:
Month
Month
Days
Debt Balance
Gross Income
Debtor Days
1
Jan
31
1000
700
2
Feb
28
1100
500
54.57
3
Mar
31
900
400
59.00
4
Apr
30
400
600
20.00
5
May
31
600
400
41.00
6
Jun
30
800
550
49.38
7
Jul
31
900
700
41.91
8
Aug
31
950
500
50.93
9
Sep
30
1000
400
65.43
10
Oct
31
1100
600
67.20
11
Nov
30
1500
700
76.00
12
Dec
31
950
550
48.14
I hope the above explains the required calculation sufficiently. Of course it needs to be implemented as a measure rather than a calculated column as in the real world it needs to work with more complex scenarios with the user defining the filter context at runtime by filtering and slicing in Power BI.
If anyone can recommend a DAX calculation for Debtor Days, that would be great!
This works on a small example, probably this may not work on a large model.
There is no easy way to do that, DAX isnt a programing language and we canot use loop / recursive statements etc. We have many limitations;
We can only mimic this behavior by bulk/ force calculate (which is resource consuming task). The most interesting part is variable _zz where we calculate for each row 3 version of the main table limited to 1/2/3 rows (as you see we hardcode some value - i consider that we can find result in max 3 iteration). You can investigate this if you want by adding NewTable from this code:
filter(GENERATE(SELECTCOLUMNS(GENERATE(Sheet1, GENERATESERIES(1,3,1)),"MYK", [MonthYearKey], "MonthToCheck", [Value], "Debt", [Debt Balance]),
var _tmp = TOPN([MonthToCheck],FILTER(ALL(Sheet1), Sheet1[MonthYearKey] <= [MYK] ), Sheet1[MonthYearKey], DESC)
return row("IncomAgg", SUMX(_tmp, Sheet1[Gross Income]) )
), [IncomAgg] >= [Debt])
Next, I try to find in our Table Variable 2 information, how many months back we must go.
Full code (I use MonthYearKey for time navigating purpose):
Mes =
var __currRowDebt = SELECTEDVALUE(Sheet1[Debt Balance])
var _zz = TOPN(1,
filter(GENERATE(SELECTCOLUMNS(GENERATE(Sheet1, GENERATESERIES(1,3,1)),"MYK", [MonthYearKey], "MonthToCheck", [Value], "Debt", [Debt Balance]),
var _tmp = TOPN([MonthToCheck],FILTER(ALL(Sheet1), Sheet1[MonthYearKey] <= [MYK] ), Sheet1[MonthYearKey], DESC)
return row("IncomAgg", SUMX(_tmp, Sheet1[Gross Income]) )
), [IncomAgg] >= [Debt]), [MonthToCheck], ASC)
var __monthinscoop = sumx(_zz,[MonthToCheck]) - 2
var __backwardrunningIncom = sumx(_zz,[IncomAgg])
var _calc = CALCULATE( sum(Sheet1[Days]), filter(ALL(Sheet1), Sheet1[MonthYearKey] <= SELECTEDVALUE( Sheet1[MonthYearKey]) && Sheet1[MonthYearKey] >= SELECTEDVALUE( Sheet1[MonthYearKey]) - __monthinscoop ))
var __twik = SWITCH( TRUE()
, __monthinscoop < 0 , -1
, __monthinscoop = 0 , 1
, __monthinscoop = 1 , 3
,0)
var __GetRowValue = CALCULATE( SUM(Sheet1[Gross Income]), FILTER(ALL(Sheet1), Sheet1[MonthYearKey] = (SELECTEDVALUE( Sheet1[MonthYearKey]) + __monthinscoop - __twik)))
var __GetRowDays = CALCULATE( SUM(Sheet1[Days]), FILTER(ALL(Sheet1), Sheet1[MonthYearKey] = (SELECTEDVALUE( Sheet1[MonthYearKey]) + __monthinscoop - __twik)))
return
_calc+DIVIDE(__GetRowValue - (__backwardrunningIncom - __currRowDebt), __GetRowValue) * __GetRowDays

DAX: Calculate only when data is present for all years in dataset

I have a data set that contains sales forecast data by year over 5 years.
Each row has customer, item type, year, qty and sales price.
Not all customers buy all products in all years.
I want to get a list of all products that are purchased in all of the listed years.
An example, cut-down table looks like this:
Customer Product Year Qty Price
CustA ProdA 2020 50 100
CustA ProdA 2021 50 100
CustA ProdA 2022 50 100
CustA ProdB 2020 50 100
CustA ProdB 2021 50 100
CustA ProdC 2021 50 100
CustA ProdC 2022 50 100
CustA ProdD 2020 50 100
CustA ProdD 2021 50 100
CustA ProdD 2022 50 100
CustB ProdA 2021 50 100
CustB ProdA 2022 50 100
CustB ProdC 2020 50 100
CustB ProdC 2021 50 100
CustB ProdC 2022 50 100
CustB ProdD 2020 50 100
CustB ProdD 2021 0 100
CustB ProdD 2022 50 100
And transposed, looks like this:
Customer Product 2020 2021 2022
CustA ProdA 50 50 50
CustA ProdB 50 50
CustA ProdC 50 50
CustA ProdD 50 50 50
CustB ProdA 50 50
CustB ProdC 50 50 50
CustB ProdD 50 0 50
So, for this example, I'd want to do calculations on, or indicate rows that have a sales qty for all three years. I was trying to use the following formula which I would have compared with the max number of years in the set to mark a row as valid or not, but it's killing Excel. There are only 32,000 rows in the source table.
=CALCULATE(
DISTINCTCOUNT(DataTable[Year]),
filter(DataTable, DataTable[Product] = EARLIER(DataTable[Product])),
filter(DataTable, DataTable[Customer] = EARLIER(DataTable[Customer])),
filter(DataTable, DataTable[Qty] > 0)
)
Is there a better approach I could use for this?
How about this?
ProductList =
VAR AllYears = DISTINCTCOUNT ( 'DataTable'[Year] )
VAR Summary =
SUMMARIZE (
'DataTable',
'DataTable'[Product],
"YearsPurchased", CALCULATE (
DISTINCTCOUNT ( 'DataTable'[Year] ),
'DataTable'[Qty] > 0
)
)
RETURN
SELECTCOLUMNS (
FILTER ( Summary, [YearsPurchased] = AllYears ),
"Product", [Product]
)
The Summary aggregates at the Product level and looks at how many distinct years it had with non-zero quantity. Then you just filter for the ones that match AllYears and take the Product column.
Note that this returns a single column table and thus doesn't work as a calculated column or measure but a list is what you asked for.
Edit: To get the YearsPurchased as a calculated column, you just need part of this:
YearsPurchased =
CALCULATE (
DISTINCTCOUNT ( 'DataTable'[Year] ),
FILTER ( ALLEXCEPT ( 'DataTable', 'DataTable'[Product] ), 'DataTable'[Qty] > 0 )
)
You dont need to use dax to achieve this. Create a matrix visualization using the needed data.
It should looks like this:
Remember to disable the total and subtotal options.
This is other solution using a new column so you dont have to expand the matrix.
Column = COMBINEVALUES( " ", Table[Customer], Table[Product] )
Hope it helps you.

Populate df row value based on column header

Appreciate any help. Basically, I have a poor data set and am trying to make it more useful.
Below is a representation
df = pd.DataFrame({'State': ("Texas","California","Florida"),
'Q1 Computer Sales': (100,200,300),
'Q1 Phone Sales': (400,500,600),
'Q1 Backpack Sales': (700,800,900),
'Q2 Computer Sales': (200,200,300),
'Q2 Phone Sales': (500,500,600),
'Q2 Backpack Sales': (800,800,900)})
I would like to have a df that creates separate columns for the Quarters and Sales for the respective state.
I think perhaps regex, str.contains, and loops perhaps?
snapshot below
IIUC, you can use:
df_a = df.set_index('State')
df_a.columns = pd.MultiIndex.from_arrays(zip(*df_a.columns.str.split(' ', n=1)))
df_a.stack(0).reset_index()
Output:
State level_1 Backpack Sales Computer Sales Phone Sales
0 Texas Q1 700 100 400
1 Texas Q2 800 200 500
2 California Q1 800 200 500
3 California Q2 800 200 500
4 Florida Q1 900 300 600
5 Florida Q2 900 300 600
Or we can go further:
df_a = df.set_index('State')
df_a.columns = pd.MultiIndex.from_arrays(zip(*df_a.columns.str.split(' ', n=1)), names=['Quarters','Items'])
df_a = df_a.stack(0).reset_index()
df_a['Quarters'] = df_a['Quarters'].str.extract('(\d+)')
print(df_a)
Output:
Items State Quarters Backpack Sales Computer Sales Phone Sales
0 Texas 1 700 100 400
1 Texas 2 800 200 500
2 California 1 800 200 500
3 California 2 800 200 500
4 Florida 1 900 300 600
5 Florida 2 900 300 600

Django ORM QUERY Adjacent row sum with sqlite

In my database I'm storing data as below:
id amt
-- -------
1 100
2 -50
3 100
4 -100
5 200
I want to get output like below
id amt balance
-- ----- -------
1 100 100
2 -50 50
3 100 150
4 -100 50
5 200 250
How to do with in django orm

PowerBI - Average and Variance Calculation with conditions

I am trying to calculate Variance and Average in PowerBI. I am running into Circular dependency errors.
This is my Data,
Month Year Item Count
1 2017 Chair 100
1 2017 Chair 200
1 2017 Chair 300
1 2017 Bench 110
1 2017 Bench 140
1 2017 Bench 150
2 2017 Chair 180
2 2017 Chair 190
2 2017 Chair 250
2 2017 Bench 270
2 2017 Bench 370
3 2017 Chair 120
3 2017 Chair 150
3 2017 Bench 180
3 2017 Bench 190
4 2017 Chair 200
4 2017 Chair 210
4 2017 Bench 220
4 2017 Bench 230
.
.
.
Average = Sum of Counts for the Previous 3 months / 3
Variance = (Average - Sum(CurrentMonth)) / Average
So, because the average won't be meaningful for the first 3 months, I wouldn't be worried about that.
Expected Output,
Month Year Item Sum(CurrentMonth) Average Variance
1
1
2
2
3
3
4 2017 Chair 410 497 0.21
4 2017 Bench x y z
Lets Say for Chair,
Sum of Current Month = 200 + 210 = 410
Average of Last 3 Months = (100 + 200 + 300 + 180 + 190 + 250 + 120 + 150 )/ 3 = 1490 / 3 = 497
Variance = (497 - 410) / 410 = 87 / 410 = 0.21
Kindly share your thoughts.
I started with this as Table1 (I added a couple months data to yours):
I loaded it into Power BI and added a column called "YearMonth" using this code: YearMonth = Table1[Year]&FORMAT(Table1[Month],"00") ...to get this:
Then I added another column called "Sum(CurrentMonth)" using this code: Sum(CurrentMonth) = SUMX(FILTER(FILTER(Table1,Table1[Item]=EARLIER(Table1[Item])),VALUE(Table1[YearMonth])=VALUE(EARLIER(Table1[YearMonth]))),Table1[Count]) ...to get this:
Then I added another column called "Average" using this code: Average = SUMX(FILTER(FILTER(FILTER(Table1,Table1[Item]=EARLIER(Table1[Item])),VALUE(Table1[YearMonth])<=VALUE(EARLIER(Table1[YearMonth]))-1),VALUE(Table1[YearMonth])>=VALUE(EARLIER(Table1[YearMonth]))-3),Table1[Count])/3 ...to get this:
Lastly, I added a column called "Variance" using this code: Variance = (Table1[Average]-Table1[Sum(CurrentMonth)])/Table1[Sum(CurrentMonth)] ...to get this:
I hope this helps you.