Python remove outliers from data

Python remove outliers from data - python-2.7

I have a data frame as following:
ID Value
A 70
A 80
B 75
C 10
B 50
A 1000
C 60
B 2000
.. ..
I would like to group this data by ID, remove the outliers from the grouped data (the ones we see from the boxplot) and then calculate mean.
So far
grouped = df.groupby('ID')
statBefore = pd.DataFrame({'mean': grouped['Value'].mean(), 'median': grouped['Value'].median(), 'std' : grouped['Value'].std()})
How can I find outliers, remove them and get the statistics.

I believe the method you're referring to is to remove values > 1.5 * the interquartile range away from the median. So first, calculate your initial statistics:
statBefore = pd.DataFrame({'q1': grouped['Value'].quantile(.25), \
'median': grouped['Value'].median(), 'q3' : grouped['Value'].quantile(.75)})
And then determine whether values in the original DF are outliers:
def is_outlier(row):
iq_range = statBefore.loc[row.ID]['q3'] - statBefore.loc[row.ID]['q1']
median = statBefore.loc[row.ID]['median']
if row.Value > (median + (1.5* iq_range)) or row.Value < (median - (1.5* iq_range)):
return True
else:
return False
#apply the function to the original df:
df.loc[:, 'outlier'] = df.apply(is_outlier, axis = 1)
#filter to only non-outliers:
df_no_outliers = df[~(df.outlier)]

Q1 = df['Value'].quantile(0.25)
Q3 = df['Value'].quantile(0.75)
IQR = Q3 - Q1
data = df[~((df['Value'] < (Q1 - 1.5 * IQR)) |(df['Value'] > (Q3 + 1.5 *
IQR))).any(axis=1)]

just do :
In [187]: df[df<100].groupby('ID').agg(['mean','median','std'])
Out[187]:
Value
mean median std
ID
A 75.0 75.0 7.071068
B 62.5 62.5 17.677670
C 35.0 35.0 35.355339

Related

How can I write a query to carry a remaining balance of hours forward for load leveling a schedule?

I have a query result with a total amount of hours scheduled per week in chronological order without gaps and have a set amount of hours that can be processed each week. Any hours not processed should be carried over to one or more following weeks. The following information is available.
Week | Hours | Capacity
1 2000 160
2 100 160
3 0 140
4 150 160
5 500 160
6 1500 160
Each week it should reduce the new hours plus carried over hours by the Capacity but never go below zero. A positive value should carry into the following week(s).
Week | Hours | Capacity | LeftOver = (Hours + LAG(LeftOver) - Capacity)
1 400 160 240 (400 + 0 - 160)
2 100 160 180 (100 + 240 - 160)
3 0 140 40 ( 0 + 180 - 140)
4 20 160 0 ( 20 + 40 - 160) (no negative, change to zero)
5 500 160 340 (500 + 0 - 160)
6 0 160 180 ( 0 + 340 - 160)
I'm assuming this can be done with cte recursion and a running value that doesn't go below zero but I can't find any specific examples of how this would be written.

Well, you are not wrong, a recursive common table expression is indeed an option to construct a solution.
Construction of recursive queries can generally be done in steps. Run your query after every step and validate the result.
Define the "anchor" of your recursion: where does the recursion start?Here the start is defined by Week = 1.
Define a recursion iteration: what is the relation between iterations?Here that would be the incrementing week numbers d.Week = r.Week + 1.
Avoiding negative numbers can be resolved with a case expression.
Sample data
create table data
(
Week int,
Hours int,
Capacity int
);
insert into data (Week, Hours, Capacity) values
(1, 400, 160),
(2, 100, 160),
(3, 0, 140),
(4, 20, 160),
(5, 500, 160),
(6, 0, 160);
Solution
with rcte as
(
select d.Week,
d.Hours,
d.Capacity,
case
when d.Hours - d.Capacity > 0
then d.Hours - d.Capacity
else 0
end as LeftOver
from data d
where d.Week = 1
union all
select d.Week,
d.Hours,
d.Capacity,
case
when d.Hours + r.LeftOver - d.Capacity > 0
then d.Hours + r.LeftOver - d.Capacity
else 0
end
from rcte r
join data d
on d.Week = r.Week + 1
)
select r.Week,
r.Hours,
r.Capacity,
r.LeftOver
from rcte r
order by r.Week;
Result
Week Hours Capacity LeftOver
---- ----- -------- --------
1 400 160 240
2 100 160 180
3 0 140 40
4 20 160 0
5 500 160 340
6 0 160 180
Fiddle to see things in action.

I ended up writing a few CTEs then a recursive CTE and got what I needed. The capacity is a static number here but will be replaced later with one that takes holidays and vacations into account. Will also need to consider the initial 'LeftOver' value for the first week but could use this query with an earlier date period to find the most recent date with a zero LeftOver value then use that as a new start date, then filter out those earlier weeks in the final query.
DECLARE #StartDate date = (SELECT MAX(FirstDayOfWorkWeek) FROM dbo._Calendar WHERE Date <= GETDATE());
DECLARE #EndDate date = DATEADD(week, 12, #StartDate);
DECLARE #EmployeeQty int = (SELECT ISNULL(COUNT(*), 0) FROM Employee WHERE DefaultDepartment IN (4) AND Hidden = 0 AND DateTerminated IS NULL);
WITH hours AS (
/* GRAB ALL NEW HOURS SCHEDULED FOR EACH WEEK IN THE SELECTED PERIOD */
SELECT c.FirstDayOfWorkWeek as [Date]
, SUM(budget.Hours) as hours
FROM dbo.Project_Phase phase
JOIN dbo.Project_Budget_Labor budget on phase.ID = budget.Phase
JOIN dbo._Calendar c on CONVERT(date, phase.Date1) = c.[Date]
WHERE phase.CompletedOn IS NULL AND phase.Project <> 4266
AND phase.Date1 BETWEEN #StartDate AND #EndDate
AND budget.Department IN (4)
GROUP BY c.FirstDayOfWorkWeek
)
, weeks AS (
/* CREATE BLANK ROWS FOR EACH WEEK AND JOIN TO ACTUAL HOURS TO ELIMINATE GAPS */
/* ADD A ROW NUMBER FOR RECURSION IN NEXT CTE */
SELECT cal.[Date]
, ROW_NUMBER() OVER(ORDER BY cal.[Date]) as [rownum]
, ISNULL(SUM(hours.Hours), 0) as Hours
FROM (SELECT FirstDayOfWorkWeek as [Date] FROM dbo._Calendar WHERE [Date] BETWEEN #StartDate AND #EndDate GROUP BY FirstDayOfWorkWeek) as cal
LEFT JOIN hours on cal.[Date] = hours.[Date]
GROUP BY cal.[Date]
)
, spread AS (
/* GRAB FIRST WEEK AND USE RECURSION TO CREATE RUNNING TOTAL THAT DOES NOT DROP BELOW ZERO*/
SELECT TOP 1 [Date]
, rownum
, Hours
, #EmployeeQty * 40 as Capacity
, CONVERT(numeric(9,2), 0.00) as LeftOver
, Hours as running
FROM weeks
ORDER BY rownum
UNION ALL
SELECT curr.[Date]
, curr.rownum
, curr.Hours
, #EmployeeQty * 40 as Capacity
, CONVERT(numeric(9,2), CASE WHEN curr.Hours + prev.LeftOver - (#EmployeeQty * 40) < 0 THEN 0 ELSE curr.Hours + prev.LeftOver - (#EmployeeQty * 40) END) as LeftOver
, curr.Hours + prev.LeftOver as running
FROM weeks curr
JOIN spread prev on curr.rownum = (prev.rownum + 1)
)
SELECT spread.Hours as NewHours
, spread.LeftOver as PrevHours
, spread.Capacity
, spread.running as RunningTotal
, CASE WHEN running < Capacity THEN running ELSE Capacity END as HoursThisWeek
FROM spread

How to distinguish between BLANK and 0 in a PowerBI Measure?

I have 2 tables like this:
PM_History2
Serial# Good
A TRUE
B FALSE
A TRUE
B FALSE
C TRUE
A FALSE
C TRUE
CONTRACTS
Serial# Enrollment#
A 1
B 2
C 3
D 4
I have a measure that calculates the number of Good for TRUE:
Count of Good for True =
CALCULATE(COUNTA('PM_History2'[Good]), 'PM_History2'[Good] IN { TRUE })
I then have a measure that calculates the percentage of TRUEs for Good.
PM Score = 'PM_History2'[Count of Good for True]/COUNTROWS(PM_History2)
When I create a table visualization to show all the Serial# and their PM Score I get this:
Serial# PM Score
A .67
B
C 1.00
D
What can I do to get what should be a zero to come in as 0 and what should be blank to be blank. Like this:
Serial# PM Score
A .67
B 0
C 1.00
D
Thank you in advance!

Try this:
PM Score = DIVIDE ( [Count of Good for True] + 0, COUNTROWS ( PM_History2 ) )
Adding + 0 makes the numerator nonblank but the DIVIDE function still returns a blank when the denominator is blank, thus distinguishing the results for B and D.

percentage bins based on predefined buckets

I have a series of numbers and I would like to know % of numbers falling in every bucket of a dataframe.
df['cuts'] have 10, 20 and 50 as values. Specifically, I would like to what % of series are in [0-10], (10-20] and (20-50] bin and this should be appended to the df dataframe.
I wrote the following code. I definitely feel that it could be improvised. Any help is appreciated.
bin_cuts = [-1] + list(df['cuts'].values)
out = pd.cut(series, bins = bin_cuts)
df_pct_bins = pd.value_counts(out, normalize= True).reset_index()
df_pct_bins = pd.concat([df_pct_bins['index'].str.split(', ', expand = True), df_pct_bins['cuts']], axis = 1)
df_pct_bins[1] = df_pct_bins[1].str[:-1].astype(str)
df['cuts'] = df['cuts'].astype(str)
df_pct_bins = pd.merge(df, df_pct_bins, left_on= 'cuts', right_on= 1)

Consider the sample data df and s
df = pd.DataFrame(dict(cuts=[10, 20, 50]))
s = pd.Series(np.random.randint(50, size=1000))
Option 1
np.searchsorted
c = df.cuts.values
df.assign(
pct=df.cuts.map(
pd.value_counts(
c[np.searchsorted(c, s)],
normalize=True
)))
cuts pct
0 10 0.216
1 20 0.206
2 50 0.578
Option 2
pd.cut
c = df.cuts.values
df.assign(
pct=df.cuts.map(
pd.cut(
s,
np.append(-np.inf, c),
labels=c
).value_counts(normalize=True)
))
cuts pct
0 10 0.216
1 20 0.206
2 50 0.578

For a pandas dataframe column, TypeError: float() argument must be a string or a number

here is the code where 'LoanAmount', 'ApplicantIncome', 'CoapplicantIncome' are type objects:
document=pandas.read_csv("C:/Users/User/Documents/train_u6lujuX_CVtuZ9i.csv")
document.isnull().any()
document = document.fillna(lambda x: x.median())
for col in ['LoanAmount', 'ApplicantIncome', 'CoapplicantIncome']:
document[col]=document[col].astype(float)
document['LoanAmount_log'] = np.log(document['LoanAmount'])
document['TotalIncome'] = document['ApplicantIncome'] + document['CoapplicantIncome']
document['TotalIncome_log'] = np.log(document['TotalIncome'])
i get the following error in converting the object type to float:
TypeError: float() argument must be a string or a number
please help as i need to train my classification model using these features. here's a snippet of the csv file -
Loan_ID Gender Married Dependents Education Self_Employed ApplicantIncome CoapplicantIncome LoanAmount Loan_Amount_Term Credit_History Property_Area Loan_Status
LP001002 Male No 0 Graduate No 5849 0 360 1 Urban Y
LP001003 Male Yes 1 Graduate No 4583 1508 128 360 1 Rural N
LP001005 Male Yes 0 Graduate Yes 3000 0 66 360 1 Urban Y
LP001006 Male Yes 0 Not Graduate No 2583 2358 120 360 1 Urban Y

In your code document = document.fillna(lambda x: x.median()) will return a function not a value so a function cannot be converted to a float it should be either a string of numbers or an integer.
Hope the following code helps
median = document['LoanAmount'].median()
document['LoanAmount'] = document['LoanAmount'].fillna(median) # Or document = document.fillna(method='ffill')
for col in ['LoanAmount', 'ApplicantIncome', 'CoapplicantIncome']:
document[col]=document[col].astype(float)
document['LoanAmount_log'] = np.log(document['LoanAmount'])
document['TotalIncome'] = document['ApplicantIncome'] + document['CoapplicantIncome']
document['TotalIncome_log'] = np.log(document['TotalIncome'])

Pandas groupby mean absolute deviation

I have a pandas dataframe like this:
Product Group Product ID Units Sold Revenue Rev/Unit
A 451 8 $16 $2
A 987 15 $40 $2.67
A 311 2 $5 $2.50
B 642 6 $18 $3.00
B 251 4 $28 $7.00
I want to transform it to look like this:
Product Group Units Sold Revenue Rev/Unit Mean Abs Deviation
A 25 $61 $2.44 $0.24
B 10 $46 $4.60 $2.00
The Mean Abs Deviation column is to be performed on the Rev/Unit column in the first table. The tricky thing is taking into account the respective weights behind the Rev/Unit calculation.
For example taking a straight MAD of Product Group A's Rev/Unit would yield $0.26. However after taking weight into consideration, the MAD would be $0.24.
I know to use groupby to get the simple summation for units sold and revenue, but I'm a bit lost on how to do the more complicated calculations of the next 2 columns.
Also while we're giving advice/help---is there any easier way to create/paste tables into SO posts??
UPDATE:
Would a solution like this work? I know it will for the summation fields, but not sure how to implement for the latter 2 fields.
grouped_df=df.groupby("Product Group")
grouped_df.agg({
'Units Sold':'sum',
'Revenue':'sum',
'Rev/Unit':'Revenue'/'Units Sold',
'MAD':some_function})

you need to clarify what the "weights" are, I assumed the weights are the number of units sold, but that gives a different results from yours:
pv = df.pivot_table( rows='Product Group',
values=[ 'Units Sold', 'Revenue' ],
aggfunc=sum )
pv[ 'Rev/Unit' ] = pv.Revenue / pv[ 'Units Sold' ]
this gives:
Revenue Units Sold Rev/Unit
Product Group
A 61 25 2.44
B 46 10 4.60
As for WMAD:
def wmad( prod ):
idx = df[ 'Product Group' ] == prod
w = df[ 'Units Sold' ][ idx ]
abs_dev = np.abs ( df[ 'Rev/Unit' ][ idx ] - pv[ 'Rev/Unit' ][ prod ] )
return sum( abs_dev * w ) / sum( w )
pv[ 'Mean Abs Deviation' ] = [ wmad( idx ) for idx in pv.index ]
which as I mentioned gives different result
Revenue Units Sold Rev/Unit Mean Abs Deviation
Product Group
A 61 25 2.44 0.2836
B 46 10 4.60 1.9200

From your suggested solution, you can use a lambda function to operate on each row e.g:
'Rev/Unit': lambda x: calculate_revenue_per_unit(x)
Bear in mind that x is a tuple for each row, so you'll need to unpack that within your calculate_revenue_per_unit function.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Python remove outliers from data - python-2.7

Q1 = df['Value'].quantile(0.25) Q3 = df['Value'].quantile(0.75) IQR = Q3 - Q1 data = df[~((df['Value'] < (Q1 - 1.5 * IQR)) |(df['Value'] > (Q3 + 1.5 * IQR))).any(axis=1)]

just do : In [187]: df[df<100].groupby('ID').agg(['mean','median','std']) Out[187]: Value mean median std ID A 75.0 75.0 7.071068 B 62.5 62.5 17.677670 C 35.0 35.0 35.355339

Related

How can I write a query to carry a remaining balance of hours forward for load leveling a schedule?

How to distinguish between BLANK and 0 in a PowerBI Measure?

percentage bins based on predefined buckets

For a pandas dataframe column, TypeError: float() argument must be a string or a number

Pandas groupby mean absolute deviation

Categories

Resources