Pandas dynamically pattern match from second dataframe and extract string

Pandas dynamically pattern match from second dataframe and extract string - regex

Getting my knickers in a twist trying to dynamically build a regex extract pattern from a second dataframe list and populate another column with the string if it's contained in the list.
here are the two starting tables:
import pandas as pd
import re
# this is a short extract, there are 1000's of records in this table:
provinces = pd.DataFrame({'country': ['Brazil','Brazil','Brazil','Colombia','Colombia','Colombia'],
'area': ['Cerrado','Sul de Minas', 'Mococoa','Tolima','Huila','Quindio'],
'index': [13,21,19,35,36,34]})
# test dataframe
df_test = pd.DataFrame({'country':['brazil','brazil','brazil','brazil','colombia','colombia','brazil'],
'locality':['sul de minas minas gerais','chapadao cerrado','cerrado cerrado','mococa sao paulo','pitalito huila','pijao quindio','espirito santo']})
print(provinces)
country area index
0 Brazil Cerrado 13
1 Brazil Sul de Minas 21
2 Brazil Mococoa 19
3 Colombia Tolima 35
4 Colombia Huila 36
5 Colombia Quindio 34
print(df_test)
country locality
0 brazil sul de minas minas gerais
1 brazil chapadao cerrado
2 brazil cerrado cerrado
3 brazil mococa sao paulo
4 colombia pitalito huila
5 colombia pijao quindio
6 brazil espirito santo
and end result:
df_result = pd.DataFrame({'country':['brazil','brazil','brazil','brazil','colombia','colombia','brazil'],
'locality':['minas gerais','chapadao','cerrado','sao paulo','pitalito','pijao','espirito santo'],
'area': ['sul de minas','cerrado','cerrado','mococoa','huila','quindio',''],
'index': [21,13,13,19,36,34,np.nan]})
print(df_result)
country locality area index
0 brazil minas gerais sul de minas 21.0
1 brazil chapadao cerrado 13.0
2 brazil cerrado cerrado 13.0
3 brazil sao paulo mococoa 19.0
4 colombia pitalito huila 36.0
5 colombia pijao quindio 34.0
6 brazil espirito santo NaN
Can't get around the first step to populate the area column. Once the area column contains a string, stripping the same string from the locality column and adding the index column with a left join on the country and area columns is the easy part(!)
# to create the area column and extract the area string if there's a match (by string and country) in the provinces table
df_test['area'] = ''
df_test.area = df_test.locality.str.extract(flags=re.IGNORECASE, pat = r'(\b{}\b)'.format('|'.join(provinces.loc[provinces.country.str.lower()==df_test.country,'area'].str.lower().to_list()), expand=False))
and I'd also need to apply a map to exclude some records from this step.
# as above but for added complexity, populate the area column only if df_test.country == 'brazil':
df_test['area'] = ''
mapping = df_test.country =='brazil'
df_test.loc[mapping,'area'] = df_test.loc[mapping,'locality'].str.extract(flags=re.IGNORECASE, pat = r'(\b{}\b)'.format('|'.join(provinces.loc[provinces.country.str.lower()==df_test.country,'area'].str.lower().to_list()), expand=False))
All the vectorised regex extract solutions I've found rely on pre-defined regex patterns, but given these patterns need to come from the provinces dataframe where the countries match, this question and answer seemed like the closet match to this scenario but I couldn't make sense of it...
Thanks in advance

following the trail of error messages (and sleep!), "Can only compare identically-labeled Series objects" resolved with this answer
And then "ValueError: Lengths must match to compare" with this answer
here's the solution:
df_test['area'] = ''
df_test.area = df_test.locality.str.extract(flags=re.IGNORECASE, pat = r'({})'.format('|'.join(provinces.loc[provinces.country.str.lower().isin(df_test.country),'area'].str.lower().to_list()), expand=False))
[out]
country locality area
0 brazil sul de minas minas gerais sul de minas
1 brazil chapadao cerrado cerrado
2 brazil cerrado cerrado cerrado
3 brazil mococoa sao paulo mococoa
4 colombia pitalito huila huila
5 colombia pijao quindio quindio
6 brazil espirito santo NaN

Related

How do I reshape data by groups? (Stata)

I need some help with reshaping some data into groups. The variables are country1 and country2, and samegroup, which indicates if the countries are in the same group (continent). The original data I have is something like this:
country1
country2
samegroup
China
Vietnam
1
France
Italy
1
Brazil
Argentina
1
Argentina
Brazil
1
Australia
US
0
US
Australia
0
Vietnam
China
1
Vietnam
Thailand
1
Thailand
Vietnam
1
Italy
France
1
And I would like the output to be this:
country
group
China
1
Vietnam
1
Thailand
1
Italy
2
France
2
Brazil
3
Argentina
3
Australia
4
US
5
My first instinct would be to sort the initial data by "samegroup", then reshape (long to wide). But that doesn't quite solve the issue and I'm not sure how to continue from there. Any help would be greatly appreciated!

Unless you have a non-standard definition of continent, it is much easier to use kountry (which you will probably have to install) than reshape or repeated merges:
clear
input str12 country1 str12 country2 byte samegroup
China Vietnam 1
France Italy 1
Brazil Argentina 1
Argentina Brazil 1
Australia US 0
US Australia 0
Vietnam China 1
Vietnam Thailand 1
Thailand Vietnam 1
Italy France 1
end
capture net install dm0038_1
kountry country1, from(other) geo(marc) marker
rename (country1 GEO) (country group)
sort group country
capture ssc install sencode
sencode group, replace // or use recode here
keep country group
duplicates drop
list, clean noobs
label list group
This will produce
. list, clean noobs
country group
China Asia
Thailand Asia
Vietnam Asia
Australia Australasia
France Europe
Italy Europe
US North America
Argentina South America
Brazil South America
. label list group
group:
1 Asia
2 Australasia
3 Europe
4 North America
5 South America

Keeping distinct combinations (pairs) of observations

I have a dataset where I have to remove duplicate combinations.
These combinations are pairs of places, one each in two columns:
ID Place1 Place2
1 Ann Arbor Toledo
2 LA San Francisco
3 Chicago Peoria
4 Pittsburgh Cleveland
5 Richmond New Port
6 Ann Arbor Cincinnati
7 LA San Francisco
8 LA San Jose
9 Springfield Chicago
10 Richmond New Port
11 Atlanta Greenville
How can I get the output below?
ID Place1 Place2
1 Ann Arbor Toledo
2 LA San Francisco
3 Chicago Peoria
4 Pittsburgh Cleveland
5 Richmond New Port
6 Ann Arbor Cincinnati
7 LA San Jose
8 Springfield Chicago
9 Atlanta Greenville

The following works for me:
clear
input ID str20 Place1 str20 Place2
1 "Ann Arbor" "Toledo"
2 "LA" "San Francisco"
3 "Chicago" "Peoria"
4 "Pittsburgh" "Cleveland"
5 "Richmond" "New Port"
6 "Ann Arbor" "Cincinnati"
7 "LA" "San Francisco"
8 "LA" "San Jose"
9 "Springfield" "Chicago"
10 "Richmond" "New Port"
11 "Atlanta" "Greenville"
end
duplicates drop Place1 Place2, force
list, separator(0)
+----------------------------------+
| ID Place1 Place2 |
|----------------------------------|
1. | 1 Ann Arbor Toledo |
2. | 2 LA San Francisco |
3. | 3 Chicago Peoria |
4. | 4 Pittsburgh Cleveland |
5. | 5 Richmond New Port |
6. | 6 Ann Arbor Cincinnati |
7. | 8 LA San Jose |
8. | 9 Springfield Chicago |
9. | 11 Atlanta Greenville |
+----------------------------------+
Type help duplicates in Stata's command prompt for details and full syntax.
It is important to note that this will not work if you have pairs in your data like the one below:
LA San Francisco
San Francisco LA
See this article by #NickCox on how to deal with this case.

Find the range of year in pandas especially with hyphen formats?

Given the data below, I want to print the list of team who debut their match between 1934 to 1948. Since the Debut column is object, I am not able to get the column data in integer form.
Team Debut
0 Real Madrid 1929
1 Barcelona 1929
2 Atletico Madrid 1929
3 Valencia 1931-32
4 Athletic Bilbao 1929
5 Sevilla 1934-35
6 Espanyol 1929
7 Real Sociedad 1929
8 Zaragoza 1939-40
9 Real Betis 1932-33
10 Deportivo La Coruna 1941-42
11 Celta Vigo 1939-40
12 Valladolid 1948-49
Can somebody please help to give an idea how to achieve it?
Thanks in advance

You can use str.extract to extract first part of the date and check if its in the required range
mask = df['Debut'].str.extract('(\d+)')[0].astype(int).between(1934,1948)
df[mask]
Team Debut
5 5 Sevilla 1934-35
8 8 Zaragoza 1939-40
10 10 Deportivo La Coruna 1941-42
11 11 Celta Vigo 1939-40
12 12 Valladolid 1948-49

If only the first year of the range counts, you could use between after converting to a numeric value:
year = pd.to_numeric(df.Debut.str.split('-').str[0])
teams = df.Team[year.between(1934, 1948)]
print(teams)
Output
5 Sevilla
8 Zaragoza
10 Deportivo La Coruna
11 Celta Vigo
12 Valladolid
Name: Team, dtype: object

Converting daily data in to weekly in Pandas

I have a dataframe as given below:
Index Date Country Occurence
0 2013-12-30 US 1
1 2013-12-30 India 3
2 2014-01-10 US 1
3 2014-01-15 India 1
4 2014-02-05 UK 5
I want to convert daily data into weekly,grouped by anatomy,method being sum.
Itried resampling,but the output gave Multi Index data frame from which i was not able to access "Country" and "Date" columns(pls refer above)
The desired output is given below:
Date Country Occurence
Week1 India 4
Week2
Week1 US 2
Week2
Week5 Germany 5

You can groupby on country and resample on week
In [63]: df
Out[63]:
Date Country Occurence
0 2013-12-30 US 1
1 2013-12-30 India 3
2 2014-01-10 US 1
3 2014-01-15 India 1
4 2014-02-05 UK 5
In [64]: df.set_index('Date').groupby('Country').resample('W', how='sum')
Out[64]:
Occurence
Country Date
India 2014-01-05 3
2014-01-12 NaN
2014-01-19 1
UK 2014-02-09 5
US 2014-01-05 1
2014-01-12 1
And, you could use reset_index()
In [65]: df.set_index('Date').groupby('Country').resample('W', how='sum').reset_index()
Out[65]:
Country Date Occurence
0 India 2014-01-05 3
1 India 2014-01-12 NaN
2 India 2014-01-19 1
3 UK 2014-02-09 5
4 US 2014-01-05 1
5 US 2014-01-12 1

Self Join in Pandas: Merge all rows with the equivalent multi-index

I have one dataframe in the following form:
df = pd.read_csv('data/original.csv', sep = ',', names=["Date", "Gran", "Country", "Region", "Commodity", "Type", "Price"], header=0)
I'm trying to do a self join on the index Date, Gran, Country, Region producing rows in the form of
Date, Gran, Country, Region, CommodityX, TypeX, Price X, Commodity Y, Type Y, Prixe Y, Commodity Z, Type Z, Price Z
Every row should have all the different commodities and prices of a specific region.
Is there a simple way of doing this?
Any help is much appreciated!
Note: I simplified the example by ignoring a few attributes
Input Example:
Date Country Region Commodity Price
1 03/01/2014 India Vishakhapatnam Rice 25
2 03/01/2014 India Vishakhapatnam Tomato 30
3 03/01/2014 India Vishakhapatnam Oil 50
4 03/01/2014 India Delhi Wheat 10
5 03/01/2014 India Delhi Jowar 60
6 03/01/2014 India Delhi Bajra 10
Output Example:
Date Country Region Commodit1 Price1 Commodity2 Price2 Commodity3 Price3
1 03/01/2014 India Vishakhapatnam Rice 25 Tomato 30 Oil 50
2 03/01/2014 India Delhi Wheat 10 Jowar 60 Bajra 10

What you want to do is called a reshape (specifically, from long to wide). See this answer for more information.
Unfortunately as far as I can tell pandas doesn't have a simple way to do that. I adapted the answer in the other thread to your problem:
df['idx'] = df.groupby(['Date','Country','Region']).cumcount()
df.pivot(index= ['Date','Country','Region'], columns='idx')[['Commodity','Price']]
Does that solve your problem?

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Pandas dynamically pattern match from second dataframe and extract string - regex

Related

How do I reshape data by groups? (Stata)

Keeping distinct combinations (pairs) of observations

Find the range of year in pandas especially with hyphen formats?

Converting daily data in to weekly in Pandas

Self Join in Pandas: Merge all rows with the equivalent multi-index

Categories

Resources