pandas: create a left join and add rows - python-2.7

I would like to join two Datafame together
left = DataFrame({'Title': ['Paris Match', 'Lancome', 'Channel'],
'City': ['Paris', 'Milan', 'Montpellier']})
right = DataFrame({'Title': ['Lulu', 'Channel', 'Balance', 'Paris Match', 'Shaq', 'And 1'],
'City': ['New york', 'Valparaiso' ,'Montreal', 'Paris', 'Los Angeles', 'Brooklyn'],
'Price':[10,20,30,40,50,60]})
and the result expected is:
r = DataFrame({'Title': ['Paris Match', 'Lancome', 'Channel','Lulu', 'Balance', 'Shaq', 'And 1'],
'City': ['Paris', 'Milan', 'Montpellier', 'Montreal', 'Paris', 'Los Angeles', 'Brooklyn'],
'Price':[40,'NaN',30,40,50,60,'Nan']})
r[['Title', 'City', 'Price']]
I'm doing result = left.join(right) and I'm getting an columns overlap error on Title and City

Perform an outer merge:
In [30]:
left.merge(right, how='outer')
Out[30]:
City Title Price
0 Paris Paris Match 40
1 Milan Lancome NaN
2 Montpellier Channel NaN
3 New york Lulu 10
4 Valparaiso Channel 20
5 Montreal Balance 30
6 Los Angeles Shaq 50
7 Brooklyn And 1 60

Related

Window Functions in Apache Beam

Does anybody know how to performe a window function in apache beam (dataflow)?
Example:
Ex
ID Sector Country Income
1 Liam US 16133
2 Noah BR 10184
3 Oliver ITA 11119
4 Elijah FRA 13256
5 William GER 7722
6 James AUS 9786
7 Benjamin ARG 1451
8 Lucas FRA 4541
9 Henry US 9111
10 Alexander ITA 13002
11 Olivia ENG 5143
12 Emma US 18076
13 Ava MEX 15930
14 Charlotte ENG 18247
15 Sophia BR 9578
16 Amelia FRA 10813
17 Isabella FRA 7575
18 Mia GER 14875
19 Evelyn AUS 19749
20 Harper ITA 19642
Questions:
How to create another column with the running sum of the Income ordered by ID?
How to create another column with the Rank of the people who earns the most
Thank You
Bruno
Consider below approach. I have tried my best to make sure that the Pado fns are associative and commutative. Which means this should not break when run parallel on multiple workers. Let me know if you find this breaking over DataflowRunner
import apache_beam as beam
from apache_beam.transforms.core import DoFn
class cum_sum(DoFn):
def process(self, element,lkp_data,accum_sum):
for lkp_id_income in lkp_data:
if element['ID'] >= lkp_id_income[0]:
accum_sum += lkp_id_income[1]
element.update({'cumulative_sum':accum_sum})
yield element
class rank_it(DoFn):
def process(self, element, lkp_data,counter):
for lkp_id_cumsum in lkp_data:
if lkp_id_cumsum['cumulative_sum'] < element['cumulative_sum']:
counter += 1
element.update({'rank':counter})
yield element
with beam.Pipeline() as p:
data = (
p
| 'create'>>beam.Create(
[
{
'ID':4,
'Sector':'Liam',
'Country':'US',
'Income':1400
},
{
'ID':2,
'Sector':'piam',
'Country':'IS',
'Income':1200
},
{
'ID':1,
'Sector':'Oiam',
'Country':'PS',
'Income':1300
},
{
'ID':3,
'Sector':'Uiam',
'Country':'OS',
'Income':1800
}
]
)
)
ids_income = (
data
| 'get_ids_income'>>beam.Map(lambda element: (element['ID'], element['Income']))
)
with_cumulative_sum = (
data
| 'cumulative_sum'>>beam.ParDo(cum_sum(),lkp_data = beam.pvalue.AsIter(ids_income),accum_sum = 0)
)
with_ranking =(
with_cumulative_sum
| 'ranking'>>beam.ParDo(rank_it(),lkp_data = beam.pvalue.AsIter(with_cumulative_sum),counter = 1)
| 'print'>>beam.Map(print)
)
Output
{'ID': 4, 'Sector': 'Liam', 'Country': 'US', 'Income': 1400, 'cumulative_sum': 5700, 'rank': 4}
{'ID': 2, 'Sector': 'piam', 'Country': 'IS', 'Income': 1200, 'cumulative_sum': 2500, 'rank': 2}
{'ID': 1, 'Sector': 'Oiam', 'Country': 'PS', 'Income': 1300, 'cumulative_sum': 1300, 'rank': 1}
{'ID': 3, 'Sector': 'Uiam', 'Country': 'OS', 'Income': 1800, 'cumulative_sum': 4300, 'rank': 3}
Windowing in Apache Beam subdivide your unbounded PCollection in smaller bounded chunk to apply some computation (group by, sum, avg,..).
Unbounded PCollection comes from streaming processing and windows are based on timestamp (you can create sliding window of 5 minutes for instance). In your example, you haven't timestamps, and sounds like a bounded PCollection (a batch).
Technically you can simulate timestamp by preprocessing the elements and adding a dummy time indicator. But in your case, a simple groupby, or a sort is enough to achieve what you want.

Finding duplicate values based on condition

Below is the sample data:
1 ,ASIF JAVED IQBAL JAVED,JAVED IQBAL SO INAYATHULLAH,20170103
2 ,SYED MUSTZAR ALI MUHAMMAD ILYAS SHAH,MUHAMMAD SAFEER SO SAGHEER KHAN,20170127
3 ,AHSUN SABIR SABIR ALI,MISBAH NAVEED DO NAVEED ANJUM,20170116
4 ,RASHAD IQBAL PARVAIZ IQBAL,PERVAIZ IQBAL SO GUL HUSSAIN KHAN,20170104
5 ,RASHID ALI MUGHERI ABDUL RASOOL MUGHERI,MUMTAZ ALI BOHIO,20170105
6 ,FAKHAR IMAM AHMAD ALI,MOHAMMAD AKHLAQ ASHIQ HUSSAIN,20170105
7 ,AQEEL SARWAR MUHAMMAD SARWAR BUTT,BUSHRA WAHID,20170106
8 ,SHAFAQAT ALI REHMAT ALI,SAJIDA BIBI WO MUHAMMAD ASHRAF,20170106
9 ,MUHAMMAD ISMAIL SHAFQAT HUSSAIN,USAMA IQBAL,20170103
10 ,SULEMAN ALI KACHI GHULAM ALI,MUHAMMAD SHARIF ALLAH WARAYO,20170109
1st is serial #, 2nd is sender, 3rd is receiver, 4th is date
and this data goes on for like million rows.
Now, i want to find which same sender sends the parcel to same receiver on the same date.
I wrote the following basic code for this but its very slow.
import csv
from fuzzywuzzy import fuzz
serial = []
agency = []
rem_name = []
rem_name2 = []
date = []
with open('janCSV.csv') as f:
reader = csv.reader(f)
for row in reader:
serial.append(row[0])
rem_name.append(row[2])
rem_name2.append(row[2])
date.append(row[4])
with open('output.csv', 'w') as out:
for rem1 in rem_name:
date1 = date[rem_name.index(rem1)]
serial1 = serial[rem_name.index(rem1)]
for rem2 in rem_name2:
date2 = date[rem_name2.index(rem2)]
if date1 == date2:
ratio = fuzz.ratio(rem1, rem2)
if ratio >= 90 and ratio < 100:
print serial1, rem1, rem2, date1, date2, ratio
out.write(str(serial1) + ',' + str(date1) + ',' + str(date2) + ',' + str(rem1) + ',' + str(rem2) + ','
+ str(ratio) + '\n')

Pandas Dataframe column getting implicitly converted to Nan after merge

Below are small samples of my 2 pandas dataframe:
In [65]: df1
Out[65]:
Send_Agent Send_Amount Country_Code
0 AWD120279 85.99 KW
1 API185805 22.98 PH
2 ANO080012 490.00 NO
3 AUK359401 616.16 GB
4 ACL000105 193.78 CL
In [44]: df2
Out[44]:
Country_Code Rating
0 KW Medium
1 PH Higher
2 NO Lower
3 GB Lower
4 CL Lower
In [97]: df4 = df1[df1['Send_Agent']=='AWD120279']
In [98]: df4
Out[98]:
Send_Agent Send_Amount Country_Code
0 AWD120279 85.99 KW
3359 AWD120279 200.00 KW
3878 AWD120279 203.03 KW
In [102]: df5 = df2[df2['Country_Code']=='KW']
In [104]: df5
Out[104]:
Country_Code Rating
15 KW Medium
In [105]: pd.merge(df4,df5,on='Country_Code',how='left')
Out[105]:
Send_Agent Send_Amount Country_Code Rating
0 AWD120279 85.99 KW NaN
1 AWD120279 200.00 KW NaN
2 AWD120279 203.03 KW NaN
I am not able to figure out why the 'Rating' column getting converted to Nan after the merge. Every Country_Code has a rating associated with it. So , it should never be Nan.

R - How do document the number of grepl matches based in another data frame?

This is a rather tricky question indeed. It would be awesome if someone might be able to help me out.
What I'm trying to do is the following. I have data frame in R containing every locality in a given state, scraped from Wikipedia. It looks something like this (top 10 rows). Let's call it NewHampshire.df:
Municipality County Population
1 Acworth Sullivan 891
2 Albany Carroll 735
3 Alexandria Grafton 1613
4 Allenstown Merrimack 4322
5 Alstead Cheshire 1937
6 Alton Belknap 5250
7 Amherst Hillsborough 11201
8 Andover Merrimack 2371
9 Antrim Hillsborough 2637
10 Ashland Grafton 2076
I've further compiled a new variable called grep_term, which combines the values from Municipality and County into a new, variable that functions as an or-statement, something like this:
Municipality County Population grep_term
1 Acworth Sullivan 891 "Acworth|Sullivan"
2 Albany Carroll 735 "Albany|Carroll"
and so on. Furthermore, I have another dataset, containing self-disclosed locations of 2000 Twitter users. I call it location.df, and it looks a bit like this:
[1] "London" "Orleans village VT USA" "The World"
[4] "D M V Towson " "Playa del Sol Solidaridad" "Beautiful Downtown Burbank"
[7] NA "US" "Gaithersburg Md"
[10] NA "California " "Indy"
[13] "Florida" "exsnaveen com" "Houston TX"
I want to do two things:
1: Grepl through every observation in the location.df dataset, and save a TRUE or FALSE into a new variable depending on whether the self-disclosed location is part of the list in the first dataset.
2: Save the number of matches for a particular line in the NewHampshire.df dataset to a new variable. I.e., if there are 4 matches for Acworth in the twitter location dataset, there should be a value "4" for observation 1 in the NewHampshire.df on the newly created "matches" variable
What I've done so far: I've solved task 1, as follows:
for(i in 1:234){
location.df$isRelevant <- sapply(location.df$location, function(s) grepl(NH_Places[i], s, ignore.case = TRUE))
}
How can I solve task 2, ideally in the same for loop?
Thanks in advance, any help would be greatly appreciated!
With regard to task one, you could also use:
# location vector to be matched against
loc.vec <- c("Acworth","Hillsborough","California","Amherst","Grafton","Ashland","London")
location.df <- data.frame(location=loc.vec)
# create a 'grep-vector'
places <- paste(paste(NewHampshire$Municipality, NewHampshire$County,
sep = "|"),
collapse = "|")
# match them against the available locations
location.df$isRelevant <- sapply(location.df$location,
function(s) grepl(places, s, ignore.case = TRUE))
which gives:
> location.df
location isRelevant
1 Acworth TRUE
2 Hillsborough TRUE
3 California FALSE
4 Amherst TRUE
5 Grafton TRUE
6 Ashland TRUE
7 London FALSE
To get the number of matches in the location.df with the grep_term column, you can use:
NewHampshire$n.matches <- sapply(NewHampshire$grep_term, function(x) sum(grepl(x, loc.vec)))
gives:
> NewHampshire
Municipality County Population grep_term n.matches
1 Acworth Sullivan 891 Acworth|Sullivan 1
2 Albany Carroll 735 Albany|Carroll 0
3 Alexandria Grafton 1613 Alexandria|Grafton 1
4 Allenstown Merrimack 4322 Allenstown|Merrimack 0
5 Alstead Cheshire 1937 Alstead|Cheshire 0
6 Alton Belknap 5250 Alton|Belknap 0
7 Amherst Hillsborough 11201 Amherst|Hillsborough 2
8 Andover Merrimack 2371 Andover|Merrimack 0
9 Antrim Hillsborough 2637 Antrim|Hillsborough 1
10 Ashland Grafton 2076 Ashland|Grafton 2

How do you calculate expanding mean on time series using pandas?

How would you create a column(s) in the below pandas DataFrame where the new columns are the expanding mean/median of 'val' for each 'Mod_ID_x'. Imagine this as if were time series data and 'ID' 1-2 was on Day 1 and 'ID' 3-4 was on Day 2.
I have tried every way I could think of but just can't seem to get it right.
left4 = pd.DataFrame({'ID': [1,2,3,4],'val': [10000, 25000, 20000, 40000],
'Mod_ID': [15, 35, 15, 42],'car': ['ford','honda', 'ford', 'lexus']})
right4 = pd.DataFrame({'ID': [3,1,2,4],'color': ['red', 'green', 'blue', 'grey'], 'wheel': ['4wheel','4wheel', '2wheel', '2wheel'],
'Mod_ID': [15, 15, 35, 42]})
df1 = pd.merge(left4, right4, on='ID').drop('Mod_ID_y', axis=1)
Hard to test properly on your DataFrame, but you can use something like this:
>>> df1["exp_mean"] = df1[["Mod_ID_x","val"]].groupby("Mod_ID_x").transform(pd.expanding_mean)
>>> df1
ID Mod_ID_x car val color wheel exp_mean
0 1 15 ford 10000 green 4wheel 10000
1 2 35 honda 25000 blue 2wheel 25000
2 3 15 ford 20000 red 4wheel 15000
3 4 42 lexus 40000 grey 2wheel 40000