Does anybody know how to performe a window function in apache beam (dataflow)?
Example:
Ex
ID Sector Country Income
1 Liam US 16133
2 Noah BR 10184
3 Oliver ITA 11119
4 Elijah FRA 13256
5 William GER 7722
6 James AUS 9786
7 Benjamin ARG 1451
8 Lucas FRA 4541
9 Henry US 9111
10 Alexander ITA 13002
11 Olivia ENG 5143
12 Emma US 18076
13 Ava MEX 15930
14 Charlotte ENG 18247
15 Sophia BR 9578
16 Amelia FRA 10813
17 Isabella FRA 7575
18 Mia GER 14875
19 Evelyn AUS 19749
20 Harper ITA 19642
Questions:
How to create another column with the running sum of the Income ordered by ID?
How to create another column with the Rank of the people who earns the most
Thank You
Bruno
Consider below approach. I have tried my best to make sure that the Pado fns are associative and commutative. Which means this should not break when run parallel on multiple workers. Let me know if you find this breaking over DataflowRunner
import apache_beam as beam
from apache_beam.transforms.core import DoFn
class cum_sum(DoFn):
def process(self, element,lkp_data,accum_sum):
for lkp_id_income in lkp_data:
if element['ID'] >= lkp_id_income[0]:
accum_sum += lkp_id_income[1]
element.update({'cumulative_sum':accum_sum})
yield element
class rank_it(DoFn):
def process(self, element, lkp_data,counter):
for lkp_id_cumsum in lkp_data:
if lkp_id_cumsum['cumulative_sum'] < element['cumulative_sum']:
counter += 1
element.update({'rank':counter})
yield element
with beam.Pipeline() as p:
data = (
p
| 'create'>>beam.Create(
[
{
'ID':4,
'Sector':'Liam',
'Country':'US',
'Income':1400
},
{
'ID':2,
'Sector':'piam',
'Country':'IS',
'Income':1200
},
{
'ID':1,
'Sector':'Oiam',
'Country':'PS',
'Income':1300
},
{
'ID':3,
'Sector':'Uiam',
'Country':'OS',
'Income':1800
}
]
)
)
ids_income = (
data
| 'get_ids_income'>>beam.Map(lambda element: (element['ID'], element['Income']))
)
with_cumulative_sum = (
data
| 'cumulative_sum'>>beam.ParDo(cum_sum(),lkp_data = beam.pvalue.AsIter(ids_income),accum_sum = 0)
)
with_ranking =(
with_cumulative_sum
| 'ranking'>>beam.ParDo(rank_it(),lkp_data = beam.pvalue.AsIter(with_cumulative_sum),counter = 1)
| 'print'>>beam.Map(print)
)
Output
{'ID': 4, 'Sector': 'Liam', 'Country': 'US', 'Income': 1400, 'cumulative_sum': 5700, 'rank': 4}
{'ID': 2, 'Sector': 'piam', 'Country': 'IS', 'Income': 1200, 'cumulative_sum': 2500, 'rank': 2}
{'ID': 1, 'Sector': 'Oiam', 'Country': 'PS', 'Income': 1300, 'cumulative_sum': 1300, 'rank': 1}
{'ID': 3, 'Sector': 'Uiam', 'Country': 'OS', 'Income': 1800, 'cumulative_sum': 4300, 'rank': 3}
Windowing in Apache Beam subdivide your unbounded PCollection in smaller bounded chunk to apply some computation (group by, sum, avg,..).
Unbounded PCollection comes from streaming processing and windows are based on timestamp (you can create sliding window of 5 minutes for instance). In your example, you haven't timestamps, and sounds like a bounded PCollection (a batch).
Technically you can simulate timestamp by preprocessing the elements and adding a dummy time indicator. But in your case, a simple groupby, or a sort is enough to achieve what you want.
Related
I have a table with a number of columns. I created a measure that counts how many days its been since the last entry was recorded.
Location
Days since Last entry
Book
10
Hat
4
Dress
9
Shoe
2
Bag
1
I want to create a column that shows the days since the last entry by group. (Red = 9+ days , Amber = 5+&9- days , Green = less than 4 days.
So far I tried
NewColumn=
IF (
[DaysSinceLastEntry] >= 9, "Red",
IF([DaysSinceLastEntry] < 9 && [DaysSinceLastEntry] >5 = "Amber",)
&
IF(
[DaysSinceLastEntry] <= 5, "Green"
))
The above gives something like:
Location
Days since Last entry
Group
Book
10
Red
Book
5
Amber
Book
2
Green
Hat
9
Red
Hat
5
Amber
Hat
2
Green
I want:
Location
Days since Last entry
Group
Book
10
Red
Hat
6
Amber
Dress
9
Red
Shoe
2
Green
Bag
1
Green
I cant figure out how to display the red/amber/green based on the number of days since the last entry. Doesn't have to be an if statement. Any help would be much appreciated, thank you.
Don't know if this is what you are looking for:
import pandas as pd
import plotly.graph_objs as go
# make dataframe
data = {
'Location': [
'Book',
'Hat',
'Dress',
'Shoe',
'Bag',
],
'DaysSinceLastEntry': [
10,
4,
9,
2,
1,
],
}
df = pd.DataFrame(data)
# assign color
def color_filter(x):
if x <= 5:
return '#00FF00' # green
elif 5 < x <= 9:
return '#FFBF00' # amber
else:
return '#FF0000' # red
df['Color'] = df.DaysSinceLastEntry.map(lambda x: color_filter(x))
# plot
fig = go.Figure(
go.Bar(x=df['Location'],
y=df['DaysSinceLastEntry'],
marker={'color': df['Color']})
)
fig.show()
I have 2 lists. the first is:
city_indices = list(range(0 , len(cities))) # There are 12 cities in that list
Its output is:
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]
The second list is the city names:
city_names = ['Buenos Aires',
'Toronto',
'Marakesh',
'Albuquerque',
'Los Cabos',
'Greenville',
'Archipelago Sea',
'Pyeongchang',
'Walla Walla Valley',
'Salina Island',
'Solta',
'Iguazu Falls'
]
I have to put the result of combining the two lists in a variable, names_and_ranks = []
The code I have to combine the lists is:
for index in list(range(0,len(cities))):
print(f'{city_indices[index]}' '. ', city_names[index])
its output:
Buenos Aires
Toronto
Marakesh
Albuquerque
Los Cabos
Greenville
Archipelago Sea
Pyeongchang
Walla Walla Valley
Salina Island
Solta
Iguazu Falls
Here's where I'm stuck. I can't figure out how to start the list with 1. and end in 12 or how to put the whole thing in names_and_ranks = []
Just add 1 to city_indices[index]:
for index in list(range(0,len(city_names))):
print(f'{city_indices[index] + 1}' '. ', city_names[index])
Output:
1. Buenos Aires
2. Toronto
3. Marakesh
4. Albuquerque
5. Los Cabos
6. Greenville
7. Archipelago Sea
8. Pyeongchang
9. Walla Walla Valley
10. Salina Island
11. Solta
12. Iguazu Falls
I would like to join two Datafame together
left = DataFrame({'Title': ['Paris Match', 'Lancome', 'Channel'],
'City': ['Paris', 'Milan', 'Montpellier']})
right = DataFrame({'Title': ['Lulu', 'Channel', 'Balance', 'Paris Match', 'Shaq', 'And 1'],
'City': ['New york', 'Valparaiso' ,'Montreal', 'Paris', 'Los Angeles', 'Brooklyn'],
'Price':[10,20,30,40,50,60]})
and the result expected is:
r = DataFrame({'Title': ['Paris Match', 'Lancome', 'Channel','Lulu', 'Balance', 'Shaq', 'And 1'],
'City': ['Paris', 'Milan', 'Montpellier', 'Montreal', 'Paris', 'Los Angeles', 'Brooklyn'],
'Price':[40,'NaN',30,40,50,60,'Nan']})
r[['Title', 'City', 'Price']]
I'm doing result = left.join(right) and I'm getting an columns overlap error on Title and City
Perform an outer merge:
In [30]:
left.merge(right, how='outer')
Out[30]:
City Title Price
0 Paris Paris Match 40
1 Milan Lancome NaN
2 Montpellier Channel NaN
3 New york Lulu 10
4 Valparaiso Channel 20
5 Montreal Balance 30
6 Los Angeles Shaq 50
7 Brooklyn And 1 60
I have a data frame that is production data for a factory. The factory is organised into lines. The structure of the data is such that one of the columns contains repeating values that properly thought of are headers. I need to reshape the data. So in the following DataFrame the 'Quality' column contains 4 measures, that are then measured for each hour. Clearly this gives us four observations per line.
The goal here is to transpose this data, but such that some of the columns are single index and some are multi index. The row index should remain ['Date', 'ID']. The single index columns should be 'line_no', 'floor', 'buyer' and the multi index columns should be the hourly measures for each of the quality measures.
I know that this is possible because I accidentally stumbled across the way to do it. Basically as my code will show, I put everything in the index except the hourly data and then unstacked the quality column from the index. Then by chance, I tried to reset the index and it created this amazing dataframe where some columns were single index and some multi. Of course its highly impractical to have loads of columns in the index, because we might want to do stuff with them, like change them. My question is how to achieve this type of thing without having to go through this (what I feel is a) workaraound.
import random
import pandas as pd
d = {'ID' : [1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3] * 2,
'Date' : ['2013-05-04' for x in xrange(12)] + \
['2013-05-06' for x in xrange(12)],
'line_no' : [1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3] * 2,
'floor' : [5, 5, 5, 5, 6, 6, 6, 6, 5, 5, 5, 5] * 2,
'buyer' : ['buyer1', 'buyer1', 'buyer1', 'buyer1',\
'buyer2', 'buyer2', 'buyer2', 'buyer2',\
'buyer1', 'buyer1', 'buyer1', 'buyer1'] * 2,
'Quality' : ['no_checked', 'good', 'alter', 'rejected'] * 6,
'Hour1' : [random.randint(1000, 15000) for x in xrange(24)],
'Hour2' : [random.randint(1000, 15000) for x in xrange(24)],
'Hour3' : [random.randint(1000, 15000) for x in xrange(24)],
'Hour4' : [random.randint(1000, 15000) for x in xrange(24)],
'Hour5' : [random.randint(1000, 15000) for x in xrange(24)],
'Hour6' : [random.randint(1000, 15000) for x in xrange(24)]}
DF = pd.DataFrame(d, columns = ['ID', 'Date', 'line_no', 'floor', 'buyer',
'Quality', 'Hour1', 'Hour2', 'Hour3', 'Hour4',
'Hour5', 'Hour6'])
DF.set_index(['Date', 'ID'])
So this is how I achieved what I wanted, but there must be a way to do this without having to go through all these steps. Help please...
# Reset the index
DF.reset_index(inplace = True)
# Put everything in the index
DF.set_index(['Date', 'ID', 'line_no', 'floor', 'buyer', 'Quality'], inplace = True)
# Unstack Quality
DFS = DF.unstack('Quality')
#Now this was the accidental workaround - gives exactly the result I want
DFS.reset_index(inplace = True)
DFS.set_index(['Date', 'ID'], inplace = True)
All help appreciated. Sorry for the long question, but at least there is some data riiiight!
In general inplace operations are not faster and IMHO less readable.
In [18]: df.set_index(['Date','ID','Quality']).unstack('Quality'))
Out[18]:
line_no floor buyer Hour1 Hour2 Hour3 Hour4 Hour5 Hour6
Quality alter good no_checked rejected alter good no_checked rejected alter good no_checked rejected alter good no_checked rejected alter good no_checked rejected alter good no_checked rejected
Date ID
2013-05-04 1 1 5 buyer1 6920 8681 9317 14631 5739 2112 4211 12026 13577 1855 13884 12710 7250 2540 1948 7116 9874 7302 10961 8251 3070 2793 14293 10895
2 2 6 buyer2 7943 7501 13725 1648 7178 9670 6278 6888 9969 11766 9968 4722 7242 4049 6704 2225 6546 8688 11513 14550 2140 11941 1142 6683
3 3 5 buyer1 5155 2449 13648 2183 14184 7309 1185 10454 11742 14102 2242 14297 6185 5554 12505 13312 3062 7426 4421 5693 12342 11622 10431 13375
2013-05-06 1 1 5 buyer1 14563 1343 14419 3350 8526 1185 5244 14777 2238 3640 6717 1109 7777 13136 1732 8681 14454 1059 10606 6942 9349 4524 13931 11799
2 2 6 buyer2 14837 9524 8453 6074 11516 12356 9651 10650 15000 11374 4690 10914 1857 3231 14627 6590 6503 9268 13108 8581 8448 12013 14175 10783
3 3 5 buyer1 9032 12959 4613 6793 7918 2827 6027 13002 11771 13370 12767 11080 12624 13269 11740 10543 8609 14709 11921 12484 8670 12706 8001 8991
[6 rows x 27 columns]
is a quite reasonable idiom for what you are doing
How would you create a column(s) in the below pandas DataFrame where the new columns are the expanding mean/median of 'val' for each 'Mod_ID_x'. Imagine this as if were time series data and 'ID' 1-2 was on Day 1 and 'ID' 3-4 was on Day 2.
I have tried every way I could think of but just can't seem to get it right.
left4 = pd.DataFrame({'ID': [1,2,3,4],'val': [10000, 25000, 20000, 40000],
'Mod_ID': [15, 35, 15, 42],'car': ['ford','honda', 'ford', 'lexus']})
right4 = pd.DataFrame({'ID': [3,1,2,4],'color': ['red', 'green', 'blue', 'grey'], 'wheel': ['4wheel','4wheel', '2wheel', '2wheel'],
'Mod_ID': [15, 15, 35, 42]})
df1 = pd.merge(left4, right4, on='ID').drop('Mod_ID_y', axis=1)
Hard to test properly on your DataFrame, but you can use something like this:
>>> df1["exp_mean"] = df1[["Mod_ID_x","val"]].groupby("Mod_ID_x").transform(pd.expanding_mean)
>>> df1
ID Mod_ID_x car val color wheel exp_mean
0 1 15 ford 10000 green 4wheel 10000
1 2 35 honda 25000 blue 2wheel 25000
2 3 15 ford 20000 red 4wheel 15000
3 4 42 lexus 40000 grey 2wheel 40000