Pandas groupby mean absolute deviation - python-2.7

I have a pandas dataframe like this:
Product Group Product ID Units Sold Revenue Rev/Unit
A 451 8 $16 $2
A 987 15 $40 $2.67
A 311 2 $5 $2.50
B 642 6 $18 $3.00
B 251 4 $28 $7.00
I want to transform it to look like this:
Product Group Units Sold Revenue Rev/Unit Mean Abs Deviation
A 25 $61 $2.44 $0.24
B 10 $46 $4.60 $2.00
The Mean Abs Deviation column is to be performed on the Rev/Unit column in the first table. The tricky thing is taking into account the respective weights behind the Rev/Unit calculation.
For example taking a straight MAD of Product Group A's Rev/Unit would yield $0.26. However after taking weight into consideration, the MAD would be $0.24.
I know to use groupby to get the simple summation for units sold and revenue, but I'm a bit lost on how to do the more complicated calculations of the next 2 columns.
Also while we're giving advice/help---is there any easier way to create/paste tables into SO posts??
UPDATE:
Would a solution like this work? I know it will for the summation fields, but not sure how to implement for the latter 2 fields.
grouped_df=df.groupby("Product Group")
grouped_df.agg({
'Units Sold':'sum',
'Revenue':'sum',
'Rev/Unit':'Revenue'/'Units Sold',
'MAD':some_function})

you need to clarify what the "weights" are, I assumed the weights are the number of units sold, but that gives a different results from yours:
pv = df.pivot_table( rows='Product Group',
values=[ 'Units Sold', 'Revenue' ],
aggfunc=sum )
pv[ 'Rev/Unit' ] = pv.Revenue / pv[ 'Units Sold' ]
this gives:
Revenue Units Sold Rev/Unit
Product Group
A 61 25 2.44
B 46 10 4.60
As for WMAD:
def wmad( prod ):
idx = df[ 'Product Group' ] == prod
w = df[ 'Units Sold' ][ idx ]
abs_dev = np.abs ( df[ 'Rev/Unit' ][ idx ] - pv[ 'Rev/Unit' ][ prod ] )
return sum( abs_dev * w ) / sum( w )
pv[ 'Mean Abs Deviation' ] = [ wmad( idx ) for idx in pv.index ]
which as I mentioned gives different result
Revenue Units Sold Rev/Unit Mean Abs Deviation
Product Group
A 61 25 2.44 0.2836
B 46 10 4.60 1.9200

From your suggested solution, you can use a lambda function to operate on each row e.g:
'Rev/Unit': lambda x: calculate_revenue_per_unit(x)
Bear in mind that x is a tuple for each row, so you'll need to unpack that within your calculate_revenue_per_unit function.

Related

How can I calculate the GAP between two products from different tables and with a set of conditions?

I'm relatively new around the world of Power BI. I've got two different types of diesel, each of them with different prices.
I've also got calculated Moving Averages of both, and I need to see the average GAP between them but under the condition they need to have a value in the same DAY to calculate such average, otherwise it wouldn't be valid. The tables and expected result is kind of as follows:
TABLE DIESEL TYPE A
Date
Price DIESEL TYPE A
01-feb
1,2
05-may
1,3
06-ago
1,09
06-ago
1,1
07-sep
1,5
TABLE DIESEL TYPE B
Date
Price DIESEL TYPE B
01-feb
0,9
05-may
1,05
06-ago
0,8
06-ago
0,75
12-nov
0,7
Date
Average A
Average B
01-feb
1,2
0,9
05-may
1,3
1,05
06-ago
1,095
0,775
07-sep
1,5
-
12-nov
-
0,7
The expected GAP should be:
Date
GAP Average
01-feb
0,30
05-may
0,25
06-ago
0,32
07-sep
-
12-nov
-
In September 7th and November 12th I DONT want to have these averages calculated or shown on my graph, i.e. on my measure.
Getting an average of the difference between these two prices by date and under the condition there should be values for the same date in both type of diesels, otherwise I don't want to calculate such average, if for instance, there's a value 07-sep for Type A but no for Type B, and viceversa.
Use this measure:
GAP Average =
VAR avgA =
AVERAGE('DIESEL TYPE A'[Price DIESEL TYPE A])
VAR avgB =
AVERAGE('DIESEL TYPE B'[Price DIESEL TYPE B])
RETURN
IF(
OR(ISBLANK(avgA), ISBLANK(avgB)),
BLANK(),
avgA - avgB
)

Calculate the amount of the cost of tickets finalized per material divided by the total amount of the tickets finalized

I have the following need :
Calculate the ratio between the sum of the amounts of tickets with status finalized for each material and the sum of the total amounts of the tickets finalized.
My fact table is like below :
TicketID StatusID MaterialID CategoryID Amount FKDATE
123 3 45 9 150 12/03/2021
124 5 50 4 569 11/03/2021
125 3 78 78 556 14/03/2021
126 -1 -1 -1 -1 12/03/2021
My dimension Status is like below :
StatusID Status
1 Open
2 In Process
3 Finalized
My dimension Material is like below :
MaterialID MaterielLabel
1 Bikes
.. ..
I want to exclude the TicketID with MaterialID = -1.
Try the following :
AmountFinalizedByMaterial:=
VAR AmountFinalizedByMaterialGroup =
CALCULATE (
SUM(yourFactTable[Amount]),
Status[Status] = "Finalized" ,
yourFactTable[MaterialID] <> -1)
VAR TotalAmountFinalized =
CALCULATE (
SUM(yourFactTable[Amount]),
Status[Status] = "Finalized" ,
ALL(Material)
)
RETURN
DIVIDE (
AmountFinalizedByMaterialGroup,
TotalAmountFinalized
)

How to distinguish between BLANK and 0 in a PowerBI Measure?

I have 2 tables like this:
PM_History2
Serial# Good
A TRUE
B FALSE
A TRUE
B FALSE
C TRUE
A FALSE
C TRUE
CONTRACTS
Serial# Enrollment#
A 1
B 2
C 3
D 4
I have a measure that calculates the number of Good for TRUE:
Count of Good for True =
CALCULATE(COUNTA('PM_History2'[Good]), 'PM_History2'[Good] IN { TRUE })
I then have a measure that calculates the percentage of TRUEs for Good.
PM Score = 'PM_History2'[Count of Good for True]/COUNTROWS(PM_History2)
When I create a table visualization to show all the Serial# and their PM Score I get this:
Serial# PM Score
A .67
B
C 1.00
D
What can I do to get what should be a zero to come in as 0 and what should be blank to be blank. Like this:
Serial# PM Score
A .67
B 0
C 1.00
D
Thank you in advance!
Try this:
PM Score = DIVIDE ( [Count of Good for True] + 0, COUNTROWS ( PM_History2 ) )
Adding + 0 makes the numerator nonblank but the DIVIDE function still returns a blank when the denominator is blank, thus distinguishing the results for B and D.

percentage bins based on predefined buckets

I have a series of numbers and I would like to know % of numbers falling in every bucket of a dataframe.
df['cuts'] have 10, 20 and 50 as values. Specifically, I would like to what % of series are in [0-10], (10-20] and (20-50] bin and this should be appended to the df dataframe.
I wrote the following code. I definitely feel that it could be improvised. Any help is appreciated.
bin_cuts = [-1] + list(df['cuts'].values)
out = pd.cut(series, bins = bin_cuts)
df_pct_bins = pd.value_counts(out, normalize= True).reset_index()
df_pct_bins = pd.concat([df_pct_bins['index'].str.split(', ', expand = True), df_pct_bins['cuts']], axis = 1)
df_pct_bins[1] = df_pct_bins[1].str[:-1].astype(str)
df['cuts'] = df['cuts'].astype(str)
df_pct_bins = pd.merge(df, df_pct_bins, left_on= 'cuts', right_on= 1)
Consider the sample data df and s
df = pd.DataFrame(dict(cuts=[10, 20, 50]))
s = pd.Series(np.random.randint(50, size=1000))
Option 1
np.searchsorted
c = df.cuts.values
df.assign(
pct=df.cuts.map(
pd.value_counts(
c[np.searchsorted(c, s)],
normalize=True
)))
cuts pct
0 10 0.216
1 20 0.206
2 50 0.578
Option 2
pd.cut
c = df.cuts.values
df.assign(
pct=df.cuts.map(
pd.cut(
s,
np.append(-np.inf, c),
labels=c
).value_counts(normalize=True)
))
cuts pct
0 10 0.216
1 20 0.206
2 50 0.578

Python remove outliers from data

I have a data frame as following:
ID Value
A 70
A 80
B 75
C 10
B 50
A 1000
C 60
B 2000
.. ..
I would like to group this data by ID, remove the outliers from the grouped data (the ones we see from the boxplot) and then calculate mean.
So far
grouped = df.groupby('ID')
statBefore = pd.DataFrame({'mean': grouped['Value'].mean(), 'median': grouped['Value'].median(), 'std' : grouped['Value'].std()})
How can I find outliers, remove them and get the statistics.
I believe the method you're referring to is to remove values > 1.5 * the interquartile range away from the median. So first, calculate your initial statistics:
statBefore = pd.DataFrame({'q1': grouped['Value'].quantile(.25), \
'median': grouped['Value'].median(), 'q3' : grouped['Value'].quantile(.75)})
And then determine whether values in the original DF are outliers:
def is_outlier(row):
iq_range = statBefore.loc[row.ID]['q3'] - statBefore.loc[row.ID]['q1']
median = statBefore.loc[row.ID]['median']
if row.Value > (median + (1.5* iq_range)) or row.Value < (median - (1.5* iq_range)):
return True
else:
return False
#apply the function to the original df:
df.loc[:, 'outlier'] = df.apply(is_outlier, axis = 1)
#filter to only non-outliers:
df_no_outliers = df[~(df.outlier)]
Q1 = df['Value'].quantile(0.25)
Q3 = df['Value'].quantile(0.75)
IQR = Q3 - Q1
data = df[~((df['Value'] < (Q1 - 1.5 * IQR)) |(df['Value'] > (Q3 + 1.5 *
IQR))).any(axis=1)]
just do :
In [187]: df[df<100].groupby('ID').agg(['mean','median','std'])
Out[187]:
Value
mean median std
ID
A 75.0 75.0 7.071068
B 62.5 62.5 17.677670
C 35.0 35.0 35.355339