How do you calculate expanding mean on time series using pandas? - python-2.7

How would you create a column(s) in the below pandas DataFrame where the new columns are the expanding mean/median of 'val' for each 'Mod_ID_x'. Imagine this as if were time series data and 'ID' 1-2 was on Day 1 and 'ID' 3-4 was on Day 2.
I have tried every way I could think of but just can't seem to get it right.
left4 = pd.DataFrame({'ID': [1,2,3,4],'val': [10000, 25000, 20000, 40000],
'Mod_ID': [15, 35, 15, 42],'car': ['ford','honda', 'ford', 'lexus']})
right4 = pd.DataFrame({'ID': [3,1,2,4],'color': ['red', 'green', 'blue', 'grey'], 'wheel': ['4wheel','4wheel', '2wheel', '2wheel'],
'Mod_ID': [15, 15, 35, 42]})
df1 = pd.merge(left4, right4, on='ID').drop('Mod_ID_y', axis=1)

Hard to test properly on your DataFrame, but you can use something like this:
>>> df1["exp_mean"] = df1[["Mod_ID_x","val"]].groupby("Mod_ID_x").transform(pd.expanding_mean)
>>> df1
ID Mod_ID_x car val color wheel exp_mean
0 1 15 ford 10000 green 4wheel 10000
1 2 35 honda 25000 blue 2wheel 25000
2 3 15 ford 20000 red 4wheel 15000
3 4 42 lexus 40000 grey 2wheel 40000

Related

Creating an If statement with multiple conditions in Power Bi

I have a table with a number of columns. I created a measure that counts how many days its been since the last entry was recorded.
Location
Days since Last entry
Book
10
Hat
4
Dress
9
Shoe
2
Bag
1
I want to create a column that shows the days since the last entry by group. (Red = 9+ days , Amber = 5+&9- days , Green = less than 4 days.
So far I tried
NewColumn=
IF (
[DaysSinceLastEntry] >= 9, "Red",
IF([DaysSinceLastEntry] < 9 && [DaysSinceLastEntry] >5 = "Amber",)
&
IF(
[DaysSinceLastEntry] <= 5, "Green"
))
The above gives something like:
Location
Days since Last entry
Group
Book
10
Red
Book
5
Amber
Book
2
Green
Hat
9
Red
Hat
5
Amber
Hat
2
Green
I want:
Location
Days since Last entry
Group
Book
10
Red
Hat
6
Amber
Dress
9
Red
Shoe
2
Green
Bag
1
Green
I cant figure out how to display the red/amber/green based on the number of days since the last entry. Doesn't have to be an if statement. Any help would be much appreciated, thank you.
Don't know if this is what you are looking for:
import pandas as pd
import plotly.graph_objs as go
# make dataframe
data = {
'Location': [
'Book',
'Hat',
'Dress',
'Shoe',
'Bag',
],
'DaysSinceLastEntry': [
10,
4,
9,
2,
1,
],
}
df = pd.DataFrame(data)
# assign color
def color_filter(x):
if x <= 5:
return '#00FF00' # green
elif 5 < x <= 9:
return '#FFBF00' # amber
else:
return '#FF0000' # red
df['Color'] = df.DaysSinceLastEntry.map(lambda x: color_filter(x))
# plot
fig = go.Figure(
go.Bar(x=df['Location'],
y=df['DaysSinceLastEntry'],
marker={'color': df['Color']})
)
fig.show()

Window Functions in Apache Beam

Does anybody know how to performe a window function in apache beam (dataflow)?
Example:
Ex
ID Sector Country Income
1 Liam US 16133
2 Noah BR 10184
3 Oliver ITA 11119
4 Elijah FRA 13256
5 William GER 7722
6 James AUS 9786
7 Benjamin ARG 1451
8 Lucas FRA 4541
9 Henry US 9111
10 Alexander ITA 13002
11 Olivia ENG 5143
12 Emma US 18076
13 Ava MEX 15930
14 Charlotte ENG 18247
15 Sophia BR 9578
16 Amelia FRA 10813
17 Isabella FRA 7575
18 Mia GER 14875
19 Evelyn AUS 19749
20 Harper ITA 19642
Questions:
How to create another column with the running sum of the Income ordered by ID?
How to create another column with the Rank of the people who earns the most
Thank You
Bruno
Consider below approach. I have tried my best to make sure that the Pado fns are associative and commutative. Which means this should not break when run parallel on multiple workers. Let me know if you find this breaking over DataflowRunner
import apache_beam as beam
from apache_beam.transforms.core import DoFn
class cum_sum(DoFn):
def process(self, element,lkp_data,accum_sum):
for lkp_id_income in lkp_data:
if element['ID'] >= lkp_id_income[0]:
accum_sum += lkp_id_income[1]
element.update({'cumulative_sum':accum_sum})
yield element
class rank_it(DoFn):
def process(self, element, lkp_data,counter):
for lkp_id_cumsum in lkp_data:
if lkp_id_cumsum['cumulative_sum'] < element['cumulative_sum']:
counter += 1
element.update({'rank':counter})
yield element
with beam.Pipeline() as p:
data = (
p
| 'create'>>beam.Create(
[
{
'ID':4,
'Sector':'Liam',
'Country':'US',
'Income':1400
},
{
'ID':2,
'Sector':'piam',
'Country':'IS',
'Income':1200
},
{
'ID':1,
'Sector':'Oiam',
'Country':'PS',
'Income':1300
},
{
'ID':3,
'Sector':'Uiam',
'Country':'OS',
'Income':1800
}
]
)
)
ids_income = (
data
| 'get_ids_income'>>beam.Map(lambda element: (element['ID'], element['Income']))
)
with_cumulative_sum = (
data
| 'cumulative_sum'>>beam.ParDo(cum_sum(),lkp_data = beam.pvalue.AsIter(ids_income),accum_sum = 0)
)
with_ranking =(
with_cumulative_sum
| 'ranking'>>beam.ParDo(rank_it(),lkp_data = beam.pvalue.AsIter(with_cumulative_sum),counter = 1)
| 'print'>>beam.Map(print)
)
Output
{'ID': 4, 'Sector': 'Liam', 'Country': 'US', 'Income': 1400, 'cumulative_sum': 5700, 'rank': 4}
{'ID': 2, 'Sector': 'piam', 'Country': 'IS', 'Income': 1200, 'cumulative_sum': 2500, 'rank': 2}
{'ID': 1, 'Sector': 'Oiam', 'Country': 'PS', 'Income': 1300, 'cumulative_sum': 1300, 'rank': 1}
{'ID': 3, 'Sector': 'Uiam', 'Country': 'OS', 'Income': 1800, 'cumulative_sum': 4300, 'rank': 3}
Windowing in Apache Beam subdivide your unbounded PCollection in smaller bounded chunk to apply some computation (group by, sum, avg,..).
Unbounded PCollection comes from streaming processing and windows are based on timestamp (you can create sliding window of 5 minutes for instance). In your example, you haven't timestamps, and sounds like a bounded PCollection (a batch).
Technically you can simulate timestamp by preprocessing the elements and adding a dummy time indicator. But in your case, a simple groupby, or a sort is enough to achieve what you want.

Create list from pandas dataframe

I have a function that takes all, non-distinct, MatchId and (xG_Team1 vs xG_Team2, paired) and gives an output of as an array. which then summed up to be sse constant.
The problem with the function is it iterates through each row, duplicating MatchId. I want to stop this.
For each distinct MatchId I need the corresponding home and away goals as a list. I.e. Home_Goal and Away_Goal to be used in each iteration. from Home_Goal_time and Away_Goal_time columns of the dataframe. The list below doesn't seem to work.
MatchId Event_Id EventCode Team1 Team2 Team1_Goals
0 842079 2053 Goal Away Huachipato Cobresal 0
1 842079 2053 Goal Away Huachipato Cobresal 0
2 842080 1029 Goal Home Slovan lava 3
3 842080 1029 Goal Home Slovan lava 3
4 842080 2053 Goal Away Slovan lava 3
5 842080 1029 Goal Home Slovan lava 3
6 842634 2053 Goal Away Rosario Boca Juniors 0
7 842634 2053 Goal Away Rosario Boca Juniors 0
8 842634 2053 Goal Away Rosario Boca Juniors 0
9 842634 2054 Cancel Goal Away Rosario Boca Juniors 0
Team2_Goals xG_Team1 xG_Team2 CurrentPlaytime Home_Goal_Time Away_Goal_Time
0 2 1.79907 1.19893 2616183 0 87
1 2 1.79907 1.19893 3436780 0 115
2 1 1.70662 1.1995 3630545 121 0
3 1 1.70662 1.1995 4769519 159 0
4 1 1.70662 1.1995 5057143 0 169
5 1 1.70662 1.1995 5236213 175 0
6 2 0.82058 1.3465 2102264 0 70
7 2 0.82058 1.3465 4255871 0 142
8 2 0.82058 1.3465 5266652 0 176
9 2 0.82058 1.3465 5273611 0 0
For example MatchId = 842079, Home_goal =[], Away_Goal = [87, 115]
x1 = [1,0,0]
x2 = [0,1,0]
x3 = [0,0,1]
m = 1 ,arbitrary constant used to optimise sse.
k = 196
total_timeslot = 196
Home_Goal = [] # No Goal
Away_Goal = [] # No Goal
def sum_squared_diff(x1, x2, x3, y):
ssd = []
for k in range(total_timeslot): # k will take multiple values
if k in Home_Goal:
ssd.append(sum((x2 - y) ** 2))
elif k in Away_Goal:
ssd.append(sum((x3 - y) ** 2))
else:
ssd.append(sum((x1 - y) ** 2))
return ssd
def my_function(row):
xG_Team1 = row.xG_Team1
xG_Team2 = row.xG_Team2
return np.array([1-(xG_Team1*m + xG_Team2*m)/k, xG_Team1*m/k, xG_Team2*m/k])
results = df.apply(lambda row: sum_squared_diff(x1, x2, x3, my_function(row)), axis=1)
results
sum(results.sum())
For the three matches above the desire outcome should look like the following.
If I need an individual sse, sum(sum_squared_diff(x1, x2, x3, y)) gives me the following.
MatchId = 842079 = 3.984053038520635
MatchId = 842080 = 7.882189570700502
MatchId = 842080 = 5.929085973050213
Given the size of the original data, realistically I am after the total sum of the sse. For the above sample data, simply adding up the values give total sse=17.79532858227135.` Once I achieve this, then I will try to optimise the sse based on this figure by updating the arbitrary value m.
Here are the lists i hoped the function will iterate over.
Home_scored = s.groupby('MatchId')['Home_Goal_time'].apply(list)
Away_scored = s.groupby('MatchId')['Away_Goal_Time'].apply(list)
type(HomeGoal)
pandas.core.series.Series
Then convert it to lists.
Home_Goal = Home_scored.tolist()
Away_Goal = Away_scored.tolist()
type(Home_Goal)
list
Home_Goal
Out[303]: [[0, 0], [121, 159, 0, 175], [0, 0, 0, 0]]
Away_Goal
Out[304]: [[87, 115], [0, 0, 169, 0], [70, 142, 176, 0]]
But the function still takes Home_Goal and Away_Goal as empty list.
If you only want to consider one MatchId at a time you should .groupby('MatchID') first
df.groupby('MatchID').apply(...)

Concatenate pandas dataframe with result of apply(lambda) where lambda returns another dataframe

A dataframe stores some values in columns, passing those values to a function I get another dataframe. I'd like to concatenate the returned dataframe's columns to the original dataframe.
I tried to do something like
i = pd.concat([i, i[['cid', 'id']].apply(lambda x: xy(*x), axis=1)], axis=1)
but it did not work with error:
ValueError: cannot copy sequence with size 2 to array axis with dimension 1
So I did like this:
def xy(x, y):
return pd.DataFrame({'x': [x*2], 'y': [y*2]})
df1 = pd.DataFrame({'cid': [4, 4], 'id': [6, 10]})
print('df1:\n{}'.format(df1))
df2 = pd.DataFrame()
for _, row in df1.iterrows():
nr = xy(row['cid'], row['id'])
nr['cid'] = row['cid']
nr['id'] = row['id']
df2 = df2.append(nr, ignore_index=True)
print('df2:\n{}'.format(df2))
Output:
df1:
cid id
0 4 6
1 4 10
df2:
x y cid id
0 8 12 4 6
1 8 20 4 10
The code does not look nice and should work slowly.
Is there pandas/pythonic way to do it properly and fast working?
python 2.7
Option 0
Most directly with pd.DataFrame.assign. Not very generalizable.
df1.assign(x=df1.cid * 2, y=df1.id * 2)
cid id x y
0 4 6 8 12
1 4 10 8 20
Option 1
Use pd.DataFrame.join to add new columns
This shows how to adjoin new columns after having used apply with a lambda
df1.join(df1.apply(lambda x: pd.Series(x.values * 2, ['x', 'y']), 1))
cid id x y
0 4 6 8 12
1 4 10 8 20
Option 2
Use pd.DataFrame.assign to add new columns
This shows how to adjoin new columns after having used apply with a lambda
df1.assign(**df1.apply(lambda x: pd.Series(x.values * 2, ['x', 'y']), 1))
cid id x y
0 4 6 8 12
1 4 10 8 20
Option 3
However, if your function really is just multiplying by 2
df1.join(df1.mul(2).rename(columns=dict(cid='x', id='y')))
Or
df1.assign(**df1.mul(2).rename(columns=dict(cid='x', id='y')))

Pandas unstack but only create multi index for certain columns

I have a data frame that is production data for a factory. The factory is organised into lines. The structure of the data is such that one of the columns contains repeating values that properly thought of are headers. I need to reshape the data. So in the following DataFrame the 'Quality' column contains 4 measures, that are then measured for each hour. Clearly this gives us four observations per line.
The goal here is to transpose this data, but such that some of the columns are single index and some are multi index. The row index should remain ['Date', 'ID']. The single index columns should be 'line_no', 'floor', 'buyer' and the multi index columns should be the hourly measures for each of the quality measures.
I know that this is possible because I accidentally stumbled across the way to do it. Basically as my code will show, I put everything in the index except the hourly data and then unstacked the quality column from the index. Then by chance, I tried to reset the index and it created this amazing dataframe where some columns were single index and some multi. Of course its highly impractical to have loads of columns in the index, because we might want to do stuff with them, like change them. My question is how to achieve this type of thing without having to go through this (what I feel is a) workaraound.
import random
import pandas as pd
d = {'ID' : [1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3] * 2,
'Date' : ['2013-05-04' for x in xrange(12)] + \
['2013-05-06' for x in xrange(12)],
'line_no' : [1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3] * 2,
'floor' : [5, 5, 5, 5, 6, 6, 6, 6, 5, 5, 5, 5] * 2,
'buyer' : ['buyer1', 'buyer1', 'buyer1', 'buyer1',\
'buyer2', 'buyer2', 'buyer2', 'buyer2',\
'buyer1', 'buyer1', 'buyer1', 'buyer1'] * 2,
'Quality' : ['no_checked', 'good', 'alter', 'rejected'] * 6,
'Hour1' : [random.randint(1000, 15000) for x in xrange(24)],
'Hour2' : [random.randint(1000, 15000) for x in xrange(24)],
'Hour3' : [random.randint(1000, 15000) for x in xrange(24)],
'Hour4' : [random.randint(1000, 15000) for x in xrange(24)],
'Hour5' : [random.randint(1000, 15000) for x in xrange(24)],
'Hour6' : [random.randint(1000, 15000) for x in xrange(24)]}
DF = pd.DataFrame(d, columns = ['ID', 'Date', 'line_no', 'floor', 'buyer',
'Quality', 'Hour1', 'Hour2', 'Hour3', 'Hour4',
'Hour5', 'Hour6'])
DF.set_index(['Date', 'ID'])
So this is how I achieved what I wanted, but there must be a way to do this without having to go through all these steps. Help please...
# Reset the index
DF.reset_index(inplace = True)
# Put everything in the index
DF.set_index(['Date', 'ID', 'line_no', 'floor', 'buyer', 'Quality'], inplace = True)
# Unstack Quality
DFS = DF.unstack('Quality')
#Now this was the accidental workaround - gives exactly the result I want
DFS.reset_index(inplace = True)
DFS.set_index(['Date', 'ID'], inplace = True)
All help appreciated. Sorry for the long question, but at least there is some data riiiight!
In general inplace operations are not faster and IMHO less readable.
In [18]: df.set_index(['Date','ID','Quality']).unstack('Quality'))
Out[18]:
line_no floor buyer Hour1 Hour2 Hour3 Hour4 Hour5 Hour6
Quality alter good no_checked rejected alter good no_checked rejected alter good no_checked rejected alter good no_checked rejected alter good no_checked rejected alter good no_checked rejected
Date ID
2013-05-04 1 1 5 buyer1 6920 8681 9317 14631 5739 2112 4211 12026 13577 1855 13884 12710 7250 2540 1948 7116 9874 7302 10961 8251 3070 2793 14293 10895
2 2 6 buyer2 7943 7501 13725 1648 7178 9670 6278 6888 9969 11766 9968 4722 7242 4049 6704 2225 6546 8688 11513 14550 2140 11941 1142 6683
3 3 5 buyer1 5155 2449 13648 2183 14184 7309 1185 10454 11742 14102 2242 14297 6185 5554 12505 13312 3062 7426 4421 5693 12342 11622 10431 13375
2013-05-06 1 1 5 buyer1 14563 1343 14419 3350 8526 1185 5244 14777 2238 3640 6717 1109 7777 13136 1732 8681 14454 1059 10606 6942 9349 4524 13931 11799
2 2 6 buyer2 14837 9524 8453 6074 11516 12356 9651 10650 15000 11374 4690 10914 1857 3231 14627 6590 6503 9268 13108 8581 8448 12013 14175 10783
3 3 5 buyer1 9032 12959 4613 6793 7918 2827 6027 13002 11771 13370 12767 11080 12624 13269 11740 10543 8609 14709 11921 12484 8670 12706 8001 8991
[6 rows x 27 columns]
is a quite reasonable idiom for what you are doing