Pandas Dataframe column getting implicitly converted to Nan after merge - python-2.7

Below are small samples of my 2 pandas dataframe:
In [65]: df1
Out[65]:
Send_Agent Send_Amount Country_Code
0 AWD120279 85.99 KW
1 API185805 22.98 PH
2 ANO080012 490.00 NO
3 AUK359401 616.16 GB
4 ACL000105 193.78 CL
In [44]: df2
Out[44]:
Country_Code Rating
0 KW Medium
1 PH Higher
2 NO Lower
3 GB Lower
4 CL Lower
In [97]: df4 = df1[df1['Send_Agent']=='AWD120279']
In [98]: df4
Out[98]:
Send_Agent Send_Amount Country_Code
0 AWD120279 85.99 KW
3359 AWD120279 200.00 KW
3878 AWD120279 203.03 KW
In [102]: df5 = df2[df2['Country_Code']=='KW']
In [104]: df5
Out[104]:
Country_Code Rating
15 KW Medium
In [105]: pd.merge(df4,df5,on='Country_Code',how='left')
Out[105]:
Send_Agent Send_Amount Country_Code Rating
0 AWD120279 85.99 KW NaN
1 AWD120279 200.00 KW NaN
2 AWD120279 203.03 KW NaN
I am not able to figure out why the 'Rating' column getting converted to Nan after the merge. Every Country_Code has a rating associated with it. So , it should never be Nan.

Related

Create list from pandas dataframe

I have a function that takes all, non-distinct, MatchId and (xG_Team1 vs xG_Team2, paired) and gives an output of as an array. which then summed up to be sse constant.
The problem with the function is it iterates through each row, duplicating MatchId. I want to stop this.
For each distinct MatchId I need the corresponding home and away goals as a list. I.e. Home_Goal and Away_Goal to be used in each iteration. from Home_Goal_time and Away_Goal_time columns of the dataframe. The list below doesn't seem to work.
MatchId Event_Id EventCode Team1 Team2 Team1_Goals
0 842079 2053 Goal Away Huachipato Cobresal 0
1 842079 2053 Goal Away Huachipato Cobresal 0
2 842080 1029 Goal Home Slovan lava 3
3 842080 1029 Goal Home Slovan lava 3
4 842080 2053 Goal Away Slovan lava 3
5 842080 1029 Goal Home Slovan lava 3
6 842634 2053 Goal Away Rosario Boca Juniors 0
7 842634 2053 Goal Away Rosario Boca Juniors 0
8 842634 2053 Goal Away Rosario Boca Juniors 0
9 842634 2054 Cancel Goal Away Rosario Boca Juniors 0
Team2_Goals xG_Team1 xG_Team2 CurrentPlaytime Home_Goal_Time Away_Goal_Time
0 2 1.79907 1.19893 2616183 0 87
1 2 1.79907 1.19893 3436780 0 115
2 1 1.70662 1.1995 3630545 121 0
3 1 1.70662 1.1995 4769519 159 0
4 1 1.70662 1.1995 5057143 0 169
5 1 1.70662 1.1995 5236213 175 0
6 2 0.82058 1.3465 2102264 0 70
7 2 0.82058 1.3465 4255871 0 142
8 2 0.82058 1.3465 5266652 0 176
9 2 0.82058 1.3465 5273611 0 0
For example MatchId = 842079, Home_goal =[], Away_Goal = [87, 115]
x1 = [1,0,0]
x2 = [0,1,0]
x3 = [0,0,1]
m = 1 ,arbitrary constant used to optimise sse.
k = 196
total_timeslot = 196
Home_Goal = [] # No Goal
Away_Goal = [] # No Goal
def sum_squared_diff(x1, x2, x3, y):
ssd = []
for k in range(total_timeslot): # k will take multiple values
if k in Home_Goal:
ssd.append(sum((x2 - y) ** 2))
elif k in Away_Goal:
ssd.append(sum((x3 - y) ** 2))
else:
ssd.append(sum((x1 - y) ** 2))
return ssd
def my_function(row):
xG_Team1 = row.xG_Team1
xG_Team2 = row.xG_Team2
return np.array([1-(xG_Team1*m + xG_Team2*m)/k, xG_Team1*m/k, xG_Team2*m/k])
results = df.apply(lambda row: sum_squared_diff(x1, x2, x3, my_function(row)), axis=1)
results
sum(results.sum())
For the three matches above the desire outcome should look like the following.
If I need an individual sse, sum(sum_squared_diff(x1, x2, x3, y)) gives me the following.
MatchId = 842079 = 3.984053038520635
MatchId = 842080 = 7.882189570700502
MatchId = 842080 = 5.929085973050213
Given the size of the original data, realistically I am after the total sum of the sse. For the above sample data, simply adding up the values give total sse=17.79532858227135.` Once I achieve this, then I will try to optimise the sse based on this figure by updating the arbitrary value m.
Here are the lists i hoped the function will iterate over.
Home_scored = s.groupby('MatchId')['Home_Goal_time'].apply(list)
Away_scored = s.groupby('MatchId')['Away_Goal_Time'].apply(list)
type(HomeGoal)
pandas.core.series.Series
Then convert it to lists.
Home_Goal = Home_scored.tolist()
Away_Goal = Away_scored.tolist()
type(Home_Goal)
list
Home_Goal
Out[303]: [[0, 0], [121, 159, 0, 175], [0, 0, 0, 0]]
Away_Goal
Out[304]: [[87, 115], [0, 0, 169, 0], [70, 142, 176, 0]]
But the function still takes Home_Goal and Away_Goal as empty list.
If you only want to consider one MatchId at a time you should .groupby('MatchID') first
df.groupby('MatchID').apply(...)

Getting index values from pd mean() and std() functions

I'm trying to get the index values from a pd std().
My final objective is to match the index with another df and insert the corresponding values (standard deviations).
(in): df_std['index'] = df_std.index
(out): Index([u'AAPL US Equity', u'QQQ US Equity', u'BRABCBACNPR4 BZ Equity'...dtype='object')
However, I've been unable to add the indexes to the "right" of df_std because of the types: df_std.index is a series while df_std is a df. When I try to do it, a line is added instead of a column:
(in): df_std['index'] = df_std.index
(out):
BRSTNCLF1R25 Govt 64.0864
BRITUBACNPR1 BZ Equity 2.67762
BRSTNCNTB4O9 Govt 48.2419
BRSTNCLF1R74 Govt 64.901
PBR US Equity 0.770755
BRBBASACNOR3 BZ Equity 2.93335
BRSTNCLF1R82 Govt 65.0979
index Index([u'AAPL US Equity', u'QQQ US Equity', u'...
dtype: object
I've already tried converting it df_std.inde to a tuple and to a dataframe.
Thanks!
Edit:
I'm trying to match df_std['index'] with df_final['bloomberg_ticker'] and bring the std values to df_final['std']:
(in): print df_final
(out):
serie tipo tp_cnpjfundo valor id bloomberg_ticker \
0 NaN caixa NaN NaN 0 NaN
1 NaN titpublicos NaN NaN 1 BRSTNCLF1R17 Govt
2 NaN titpublicos NaN NaN 2 BRSTNCLF1R17 Govt
3 NaN titpublicos NaN NaN 3 BRSTNCLF1R25 Govt
(the column 'id' will be deleted later)
Use .reset_index() than assigning if what you have is a dataframe i.e
df_std = df_std.reset_index()
Example :
df = pd.DataFrame([0,1,2,3], index=['a','b','c','d'])
df = df.reset_index()
Output :
index 0
0 a 0
1 b 1
2 c 2
3 d 3
In case what you have is a series, convert that to dataframe then reset_index i.e if df_std is the series you have then
df_std = df_std.to_frame().reset_index()
I think what are trying to do is map the values of series to a specific column so you can use
df = pd.DataFrame({'col':['a','b','c','d','e'],'vales':[5,1,2,4,5]})
s = pd.Series([1,2,3],index=['a','b','c'])
df['new'] = df['col'].map(s)
Output :
col vales new
0 a 5 1.0
1 b 1 2.0
2 c 2 3.0
3 d 4 NaN
4 e 5 NaN
In your case you can use df_final['index'].map(df_std)
For conditional check if the index of series is present int he index column of dataframe then you can use .isin i.e
df['col'].isin(s.index) # Returns the boolen mask
df[df['col'].isin(s.index)] #Returns the dataframe based matched index

R - Inserting variable number of spaces into postcode string

I have a set of UK postcodes which need to be reformatted. They are made up of an incode and an outcode, where incode is of the form 'number letter letter' e.g. 2DB and the outcode is a combination of between 2 and 4 letters and numbers e.g. NW1 or SW10 or EC1A
Currently there is one space between the incode and outcode, but I need to reformat these so that the full postcode is 7 characters long e.g: ('-' stands for space)
NW1-2DB -> NW1-2DB (1 space between outcode and incode)
SW10-9NH -> SW109NH (0 spaces)
E1-6QL -> E1--6QL (2 spaces)
Data:
df <- data.frame("postcode"=c("NW1 2DB","SW10 9NH","E1 6QL"))
df
# postcode
# 1 NW1 2DB
# 2 SW10 9NH
# 3 E1 6QL
I have written a regex string to separate the outcode and incode, but couldn't find a way to add a variable number of spaces between them (this example just creates two spaces between outcode and incode).
require(dplyr)
df <- df %>% mutate(postcode_2sp = gsub('?(\\S+)\\s*?(\\d\\w{2})$','\\1 \\2', postcode)
To get around that I've tried to use mutate(),nchar() and rep():
df<-df %>%
mutate(outcode=gsub('?(\\S+)\\s*\\d\\w{2}$','\\1',postcode),
incode=gsub('\\S+\\s*?(\\d\\w{2})$','\\1',postcode)) %>%
mutate(out_length=nchar(outcode))%>%
mutate(postcode7=paste0(outcode,
paste0(rep(" ",4-out_length),collapse=""),
incode))
but get this error:
Error: invalid 'times' argument
without the last step to create postcode7 the df looks as follows:
df
# postcode outcode incode out_length
# 1 NW1 2DB NW1 2DB 3
# 2 SW10 9NH SW10 9NH 4
# 3 E1 6QL E1 6QL 2
And if I set the rep 'times' argument to a constant the code runs as expected (but doesn't do what I need it to do!)
df<-df %>%
mutate(outcode=gsub('?(\\S+)\\s*\\d\\w{2}$','\\1',postcode),
incode=gsub('\\S+\\s*?(\\d\\w{2})$','\\1',postcode)) %>%
mutate(out_length=nchar(outcode))%>%
mutate(postcode7=paste0(outcode,
paste0(rep(" ",4),collapse=""),
incode))
df
# postcode outcode incode out_length postcode7
# 1 NW1 2DB NW1 2DB 3 NW1 2DB
# 2 SW10 9NH SW10 9NH 4 SW10 9NH
# 3 E1 6QL E1 6QL 2 E1 6QL
Is there a way to make rep() accept a column as the times argument in a mutate? Or should I be looking at a totally different approach?
EDIT: I've just realised that I can use an if statement for each case of 2 characters, 3 characters or 4 characters in the outcode but that doesn't feel very elegant.
Have a look at the str_pad method from stringr package, which is suited for your case:
library(stringr)
df<-df %>%
mutate(outcode=gsub('?(\\S+)\\s*\\d\\w{2}$','\\1',postcode),
incode=gsub('\\S+\\s*?(\\d\\w{2})$','\\1',postcode)) %>%
mutate(out_length=nchar(outcode)) %>%
mutate(postcode7 = paste(outcode, str_pad(incode, 7-out_length), sep = ""))
df
# postcode outcode incode out_length postcode7
# 1 NW1 2DB NW1 2DB 3 NW1 2DB
# 2 SW10 9NH SW10 9NH 4 SW109NH
# 3 E1 6QL E1 6QL 2 E1 6QL
Another solution, using sprintf to format the output, and tidyr::extract for matching. This has the advantage of drastically simplifying both the pattern and the code for padding:
df %>%
extract(postcode, into = c('out', 'in'), '(\\S{2,4})\\s*(\\d\\w\\w)') %>%
mutate(postcode = sprintf('% -4s%s', out, `in`))
I do like the separate version posted above, but it requires that the postcodes are all separated by whitespace. In my experience this generally isn’t the case.
Using str_pad and separate:
library(dplyr)
library(tidyr)
library(stringr)
df %>%
separate(postcode, into = c("incode", "outcode"), remove = FALSE) %>%
mutate(
postcode8 = paste0(incode,
str_pad(outcode,
8 - nchar(incode), side = "left", pad = " ")))
# postcode incode outcode postcode8
# 1 NW1 2DB NW1 2DB NW1 2DB
# 2 SW10 9NH SW10 9NH SW10 9NH
# 3 E1 6QL E1 6QL E1 6QL
df%>%mutate(Postcode7=paste0(format(gsub('\\s.*$','',postcode),justify='left'),
format(gsub('^\\S+\\s','',postcode),justify='right')))

R - How do document the number of grepl matches based in another data frame?

This is a rather tricky question indeed. It would be awesome if someone might be able to help me out.
What I'm trying to do is the following. I have data frame in R containing every locality in a given state, scraped from Wikipedia. It looks something like this (top 10 rows). Let's call it NewHampshire.df:
Municipality County Population
1 Acworth Sullivan 891
2 Albany Carroll 735
3 Alexandria Grafton 1613
4 Allenstown Merrimack 4322
5 Alstead Cheshire 1937
6 Alton Belknap 5250
7 Amherst Hillsborough 11201
8 Andover Merrimack 2371
9 Antrim Hillsborough 2637
10 Ashland Grafton 2076
I've further compiled a new variable called grep_term, which combines the values from Municipality and County into a new, variable that functions as an or-statement, something like this:
Municipality County Population grep_term
1 Acworth Sullivan 891 "Acworth|Sullivan"
2 Albany Carroll 735 "Albany|Carroll"
and so on. Furthermore, I have another dataset, containing self-disclosed locations of 2000 Twitter users. I call it location.df, and it looks a bit like this:
[1] "London" "Orleans village VT USA" "The World"
[4] "D M V Towson " "Playa del Sol Solidaridad" "Beautiful Downtown Burbank"
[7] NA "US" "Gaithersburg Md"
[10] NA "California " "Indy"
[13] "Florida" "exsnaveen com" "Houston TX"
I want to do two things:
1: Grepl through every observation in the location.df dataset, and save a TRUE or FALSE into a new variable depending on whether the self-disclosed location is part of the list in the first dataset.
2: Save the number of matches for a particular line in the NewHampshire.df dataset to a new variable. I.e., if there are 4 matches for Acworth in the twitter location dataset, there should be a value "4" for observation 1 in the NewHampshire.df on the newly created "matches" variable
What I've done so far: I've solved task 1, as follows:
for(i in 1:234){
location.df$isRelevant <- sapply(location.df$location, function(s) grepl(NH_Places[i], s, ignore.case = TRUE))
}
How can I solve task 2, ideally in the same for loop?
Thanks in advance, any help would be greatly appreciated!
With regard to task one, you could also use:
# location vector to be matched against
loc.vec <- c("Acworth","Hillsborough","California","Amherst","Grafton","Ashland","London")
location.df <- data.frame(location=loc.vec)
# create a 'grep-vector'
places <- paste(paste(NewHampshire$Municipality, NewHampshire$County,
sep = "|"),
collapse = "|")
# match them against the available locations
location.df$isRelevant <- sapply(location.df$location,
function(s) grepl(places, s, ignore.case = TRUE))
which gives:
> location.df
location isRelevant
1 Acworth TRUE
2 Hillsborough TRUE
3 California FALSE
4 Amherst TRUE
5 Grafton TRUE
6 Ashland TRUE
7 London FALSE
To get the number of matches in the location.df with the grep_term column, you can use:
NewHampshire$n.matches <- sapply(NewHampshire$grep_term, function(x) sum(grepl(x, loc.vec)))
gives:
> NewHampshire
Municipality County Population grep_term n.matches
1 Acworth Sullivan 891 Acworth|Sullivan 1
2 Albany Carroll 735 Albany|Carroll 0
3 Alexandria Grafton 1613 Alexandria|Grafton 1
4 Allenstown Merrimack 4322 Allenstown|Merrimack 0
5 Alstead Cheshire 1937 Alstead|Cheshire 0
6 Alton Belknap 5250 Alton|Belknap 0
7 Amherst Hillsborough 11201 Amherst|Hillsborough 2
8 Andover Merrimack 2371 Andover|Merrimack 0
9 Antrim Hillsborough 2637 Antrim|Hillsborough 1
10 Ashland Grafton 2076 Ashland|Grafton 2

pandas: create a left join and add rows

I would like to join two Datafame together
left = DataFrame({'Title': ['Paris Match', 'Lancome', 'Channel'],
'City': ['Paris', 'Milan', 'Montpellier']})
right = DataFrame({'Title': ['Lulu', 'Channel', 'Balance', 'Paris Match', 'Shaq', 'And 1'],
'City': ['New york', 'Valparaiso' ,'Montreal', 'Paris', 'Los Angeles', 'Brooklyn'],
'Price':[10,20,30,40,50,60]})
and the result expected is:
r = DataFrame({'Title': ['Paris Match', 'Lancome', 'Channel','Lulu', 'Balance', 'Shaq', 'And 1'],
'City': ['Paris', 'Milan', 'Montpellier', 'Montreal', 'Paris', 'Los Angeles', 'Brooklyn'],
'Price':[40,'NaN',30,40,50,60,'Nan']})
r[['Title', 'City', 'Price']]
I'm doing result = left.join(right) and I'm getting an columns overlap error on Title and City
Perform an outer merge:
In [30]:
left.merge(right, how='outer')
Out[30]:
City Title Price
0 Paris Paris Match 40
1 Milan Lancome NaN
2 Montpellier Channel NaN
3 New york Lulu 10
4 Valparaiso Channel 20
5 Montreal Balance 30
6 Los Angeles Shaq 50
7 Brooklyn And 1 60