Geocoding Geopandas - geocoding

Hey guys im running into an issue with geocoding. I have a address.csv with hundreds of addresses. I am using the geopandas.tools.geocode and my provider is bing to get zipcodes for all these addresses, but I am unsure on how to extract just the zipcode.
This is my current output.
geometry address
0 POINT (-112.13369 33.84443) 39508 N Daisy Mountain Dr, Anthem, AZ 85086, U...
1 POINT (-112.13671 33.86698) 3640 W Anthem Way, Anthem, AZ 85086, United St...
I just want to extract the zipcode from the address field, and add it into my address.csv file.

One way of doing this it the following:
import pandas as pd
df = pd.read_csv("adresses.csv", sep=";")
print(df)
geometry \
0 POINT (-112.13369 33.84443)
1 POINT (-112.13671 33.86698)
address
0 39508 N Daisy Mountain Dr, Anthem, AZ 85086, U...
1 3640 W Anthem Way, Anthem, AZ 85086, United St...
and split the column you want by delimiters:
adresses = df['address']
df[['street','town', 'zip', 'other']] = adresses.str.split(",", n=4, expand=True)
df
which return:
geometry \
0 POINT (-112.13369 33.84443)
1 POINT (-112.13671 33.86698)
address \
0 39508 N Daisy Mountain Dr, Anthem, AZ 85086, U...
1 3640 W Anthem Way, Anthem, AZ 85086, United St...
street town zip other
0 39508 N Daisy Mountain Dr Anthem AZ 85086 United States
1 3640 W Anthem Way Anthem AZ 85086 United States
I don't really know how zip-codes work, but if you do not want the AZ (Arizona) you can repeat this by
df[['State','code']]= df.zip.str.split(expand=True,)
Which gives:
geometry \
0 POINT (-112.13369 33.84443)
1 POINT (-112.13671 33.86698)
address \
0 39508 N Daisy Mountain Dr, Anthem, AZ 85086, U...
1 3640 W Anthem Way, Anthem, AZ 85086, United St...
street town zip other State code
0 39508 N Daisy Mountain Dr Anthem AZ 85086 United States AZ 85086
1 3640 W Anthem Way Anthem AZ 85086 United States AZ 85086

Related

How do I reshape data by groups? (Stata)

I need some help with reshaping some data into groups. The variables are country1 and country2, and samegroup, which indicates if the countries are in the same group (continent). The original data I have is something like this:
country1
country2
samegroup
China
Vietnam
1
France
Italy
1
Brazil
Argentina
1
Argentina
Brazil
1
Australia
US
0
US
Australia
0
Vietnam
China
1
Vietnam
Thailand
1
Thailand
Vietnam
1
Italy
France
1
And I would like the output to be this:
country
group
China
1
Vietnam
1
Thailand
1
Italy
2
France
2
Brazil
3
Argentina
3
Australia
4
US
5
My first instinct would be to sort the initial data by "samegroup", then reshape (long to wide). But that doesn't quite solve the issue and I'm not sure how to continue from there. Any help would be greatly appreciated!
Unless you have a non-standard definition of continent, it is much easier to use kountry (which you will probably have to install) than reshape or repeated merges:
clear
input str12 country1 str12 country2 byte samegroup
China Vietnam 1
France Italy 1
Brazil Argentina 1
Argentina Brazil 1
Australia US 0
US Australia 0
Vietnam China 1
Vietnam Thailand 1
Thailand Vietnam 1
Italy France 1
end
capture net install dm0038_1
kountry country1, from(other) geo(marc) marker
rename (country1 GEO) (country group)
sort group country
capture ssc install sencode
sencode group, replace // or use recode here
keep country group
duplicates drop
list, clean noobs
label list group
This will produce
. list, clean noobs
country group
China Asia
Thailand Asia
Vietnam Asia
Australia Australasia
France Europe
Italy Europe
US North America
Argentina South America
Brazil South America
. label list group
group:
1 Asia
2 Australasia
3 Europe
4 North America
5 South America

How to match all instances of all characters in multi-line text between (and not including) two strings [duplicate]

This question already has answers here:
Regular expression to get a string between two strings in Javascript
(13 answers)
How do you access the matched groups in a JavaScript regular expression?
(23 answers)
Closed 2 years ago.
This is my multi-line text:
Route Segment: 1
10601 Derecho Dr, Austin, TX 78737, USA to Zilker Nature Preserve, 301 Nature Center Dr, Austin, TX 78746, USA
12.8 mi
Route Segment: 2
Zilker Nature Preserve, 301 Nature Center Dr, Austin, TX 78746, USA to Roy and Ann Butler Hike and Bike Trail, 900 W Riverside Dr, Austin, TX 78704, USA
2.0 mi
Route Segment: 3
Roy and Ann Butler Hike and Bike Trail, 900 W Riverside Dr, Austin, TX 78704, USA to East Austin, Austin, TX, USA
4.4 mi
I would like to use regex to extract the first address of each segment from the text:
10601 Derecho Dr, Austin, TX 78737, USA
Zilker Nature Preserve, 301 Nature Center Dr, Austin, TX 78746, USA
Roy and Ann Butler Hike and Bike Trail, 900 W Riverside Dr, Austin, TX 78704, USA
This is the regex I am using:
(?:Route Segment: \d)([\S\s]*?)to
It is only capturing the first instance instead of all instances, it's including Route Segment:\d and to, and I want to capture what's in between, not including, those strings.
I am using the javascript regex engine (regexr.com)

Pandas dynamically pattern match from second dataframe and extract string

Getting my knickers in a twist trying to dynamically build a regex extract pattern from a second dataframe list and populate another column with the string if it's contained in the list.
here are the two starting tables:
import pandas as pd
import re
# this is a short extract, there are 1000's of records in this table:
provinces = pd.DataFrame({'country': ['Brazil','Brazil','Brazil','Colombia','Colombia','Colombia'],
'area': ['Cerrado','Sul de Minas', 'Mococoa','Tolima','Huila','Quindio'],
'index': [13,21,19,35,36,34]})
# test dataframe
df_test = pd.DataFrame({'country':['brazil','brazil','brazil','brazil','colombia','colombia','brazil'],
'locality':['sul de minas minas gerais','chapadao cerrado','cerrado cerrado','mococa sao paulo','pitalito huila','pijao quindio','espirito santo']})
print(provinces)
country area index
0 Brazil Cerrado 13
1 Brazil Sul de Minas 21
2 Brazil Mococoa 19
3 Colombia Tolima 35
4 Colombia Huila 36
5 Colombia Quindio 34
print(df_test)
country locality
0 brazil sul de minas minas gerais
1 brazil chapadao cerrado
2 brazil cerrado cerrado
3 brazil mococa sao paulo
4 colombia pitalito huila
5 colombia pijao quindio
6 brazil espirito santo
and end result:
df_result = pd.DataFrame({'country':['brazil','brazil','brazil','brazil','colombia','colombia','brazil'],
'locality':['minas gerais','chapadao','cerrado','sao paulo','pitalito','pijao','espirito santo'],
'area': ['sul de minas','cerrado','cerrado','mococoa','huila','quindio',''],
'index': [21,13,13,19,36,34,np.nan]})
print(df_result)
country locality area index
0 brazil minas gerais sul de minas 21.0
1 brazil chapadao cerrado 13.0
2 brazil cerrado cerrado 13.0
3 brazil sao paulo mococoa 19.0
4 colombia pitalito huila 36.0
5 colombia pijao quindio 34.0
6 brazil espirito santo NaN
Can't get around the first step to populate the area column. Once the area column contains a string, stripping the same string from the locality column and adding the index column with a left join on the country and area columns is the easy part(!)
# to create the area column and extract the area string if there's a match (by string and country) in the provinces table
df_test['area'] = ''
df_test.area = df_test.locality.str.extract(flags=re.IGNORECASE, pat = r'(\b{}\b)'.format('|'.join(provinces.loc[provinces.country.str.lower()==df_test.country,'area'].str.lower().to_list()), expand=False))
and I'd also need to apply a map to exclude some records from this step.
# as above but for added complexity, populate the area column only if df_test.country == 'brazil':
df_test['area'] = ''
mapping = df_test.country =='brazil'
df_test.loc[mapping,'area'] = df_test.loc[mapping,'locality'].str.extract(flags=re.IGNORECASE, pat = r'(\b{}\b)'.format('|'.join(provinces.loc[provinces.country.str.lower()==df_test.country,'area'].str.lower().to_list()), expand=False))
All the vectorised regex extract solutions I've found rely on pre-defined regex patterns, but given these patterns need to come from the provinces dataframe where the countries match, this question and answer seemed like the closet match to this scenario but I couldn't make sense of it...
Thanks in advance
following the trail of error messages (and sleep!), "Can only compare identically-labeled Series objects" resolved with this answer
And then "ValueError: Lengths must match to compare" with this answer
here's the solution:
df_test['area'] = ''
df_test.area = df_test.locality.str.extract(flags=re.IGNORECASE, pat = r'({})'.format('|'.join(provinces.loc[provinces.country.str.lower().isin(df_test.country),'area'].str.lower().to_list()), expand=False))
[out]
country locality area
0 brazil sul de minas minas gerais sul de minas
1 brazil chapadao cerrado cerrado
2 brazil cerrado cerrado cerrado
3 brazil mococoa sao paulo mococoa
4 colombia pitalito huila huila
5 colombia pijao quindio quindio
6 brazil espirito santo NaN

Regex Hero results differ from vb.net

Hi I am using Regex Hero to make a Regex. It worked as expected in Regex Hero. I then transferred it over to vb.net and now I get different results from the same exact data. I don't get it!
The Regex:
\d{10,13}.+?(?=(\bF\b|\bT\b|\bCT\b))
The .net code:
Dim strRegex As String = "\d{10,13}.+?(?=(\bF\b|\bT\b|\bCT\b))"
Dim myRegex As New Regex(strRegex, RegexOptions.None)
Dim strTargetString As String = a
For Each myMatch As Match In myRegex.Matches(strTargetString)
If myMatch.Success Then
RichTextBox1.AppendText(myMatch.Value & Environment.NewLine)
End If
Next
The Data:
" The Meijer Team appreciates your business 12/26/14 Your fast and friendly checkout was provided by ALARIA MEIJER SAVINGS SPECIALS 4.77 COUPONS 20.00 SAVINGS TOTAL 24.77 YOUR TOTAL SAVINGS SINCE 01/01/14 1,814.62 For additional savings and rewards visit mPerks.com GENERAL MERCHANDISE 7569107330 DYNO GEL THIMBL 1.49 CT 7569100487 DYNO THIMBLES 3.49 CT DRUGSTORE 2220094152 DEODORANT 1.99 T 30041667803 TOOTHBRUSH 9.99 T *70882049496 ORBIT TOOTHB was 3.69 now 2.95 T *1700006806 ANTIPERSPIRANT 1 # 2 / 6.00 was 3.99 now 3.00 T GROCERY 6414404213 CHEF BOYARDEE 2 # 1.07 2.14 F 6414404306 CHEF BOYARDEE 1.07 F 6414404315 CHEF BOYARDEE 1.07 F 6414404322 CHEF BOYARDEE 1.07 F 7680828008 SPAGHETTI 2 # 1.34 2.68 F 5100002549 PASTA SAUCE 2 # 1.97 3.94 F 4335400750 TORTILLAS 1.99 F 4400000057 SALTINES 2.69 F 4400002854 NABISCO OREOS 2 # 2.98 5.96 F 1312000484 FROZEN FRIES 2.99 F 4125002562 CHEESE SLICES 2.99 F 4125010210 MEIJER MILK 2 # 3.09 6.18 F 5150092751 PANCAKE SYRUP 3.19 F 3000032188 OATMEAL 3.29 F 3000032189 OATMEAL 3.29 F 8390000649 GOLD PEAK TEA 3.29 F 71373336283 MEIJER MILK 3.29 F 4400002734 COOKIES 3.49 F 1410007083 PEPPERDIGE FARM 2 # 3.99 7.98 F 1600027297 CEREAL 4.19 F 1600043471 CEREAL 4.19 F 4850001833 ORANGE JUICE 5.69 F 1450001420 BIRDS EYE VOILA 7.39 F *4400003113 SNACK CRACKER was 2.77 now 1.99 F *5100002526 SPAGHETTIOS 3 # 5 / 5.00 was 3.27 now 3.00 F *7192176312 FROZEN PIZZA 2 # 5.49 was 12.58 now 10.98 F *7131400331 AUNT MILLIE"S 1 # 2 / 6.00 was 3.39 now 3.00 F Total Basket Coupon => 20.00 off -20.00 Mperks # -- ********** TOTAL MI 6% Sales Tax 1.16 TOTAL TAX 1.16 TOTAL 107.09 PAYMENTS Primary Account - Debit ATM/DEBIT CARD TENDER 107.09 XXXXXXXXXXXXXXXX NUMBER OF ITEMS 42 See meijer.com or the Service Desk for current return policy. For additional savings and rewards visit mPerks.com. Tx:XXX Op:XXXXXX Tm:XX St:XX XXXXXXXXXXX How are we doing? Rate your shopping experience and you may win $1000 in Meijer gift cards! Visit us at www.meijer.com/tellmeijer or call 1-800-394-7198 Secure Code: 7800-0601-5020-3373-001 Survey should be completed within 72 hrs "

GNU Sed format USA address to Street, City, State, Zip

I have the following data for some of my customers:
719 13th Street East, Glencoe MN, 55336
626 Valley Road, Montclair NJ, 07043
666 EAST DYER ROAD, SANTA ANA CA, 92705
20800 N. 135th Ave, Sun City West AZ, 85375
9775 Herring Gull Drive, Indianapolis IN, 46280
712 21st Street, Vero Beach FL, 32960
PO BOX 324, PORT SALERNO FL, 34992
207 Middleton Road, Lafayette LA, 70503
5091 nw fiddle leaf ct, port saint lucie FL, 34986
347 Mayberry Lane, Dover DE, 19904
2648 SW 137th Ave, Miramar FL, 33027
4410 Williams Dr SUITE 104, Georgetown TX, 78628
17020 Windsor Court, Homer Glen IL, 60491
11 Technology Drive North, Warren NJ, 07059
655 Boylston St, Boston MA, 02116
1375 bishops terrace, wixom MI, 48393
4705 Center Blvd Apt. 808, Long Island City NY, 11109
5340 CORNELIA HWY, ALTO GA, 30510
1541 Paces Ferry North, Smyrna GA, 30080
603 west pacific coast hwy, wilmington CA, 90744
2503Paddock CT, Louisville KY, 40216
9421 Dunbar dr, Oakland CA, 94603
1804 Third Avenue Apt #8, New York NY, 10029
2504 bellaire st, wantagh NY, 11793
1380 avon lane apt 21, north lauderdale FL, 33068
How can I use SED regex to format it like
Street Address|City|State|Zip
eg.
719 13th Street East|Glencoe|MN|55336
626 Valley Road|Montclair|NJ|07043
666 EAST DYER ROAD|SANTA ANA|CA|92705
Thanks!
sed 's/^\(.*\), *\(.*\) \(..\), \([0-9][0-9][0-9][0-9][0-9]\)/\1|\2|\3|\4/'
or:
sed -r 's/^(.*), *(.*) (..), ([0-9]{5})/\1|\2|\3|\4/'
Output:
719 13th Street East|Glencoe|MN|55336
626 Valley Road|Montclair|NJ|07043
666 EAST DYER ROAD|SANTA ANA|CA|92705
20800 N. 135th Ave|Sun City West|AZ|85375
9775 Herring Gull Drive|Indianapolis|IN|46280
712 21st Street|Vero Beach|FL|32960
PO BOX 324|PORT SALERNO|FL|34992
207 Middleton Road|Lafayette|LA|70503
5091 nw fiddle leaf ct|port saint lucie|FL|34986
347 Mayberry Lane|Dover|DE|19904
2648 SW 137th Ave|Miramar|FL|33027
4410 Williams Dr SUITE 104|Georgetown|TX|78628
17020 Windsor Court|Homer Glen|IL|60491
11 Technology Drive North|Warren|NJ|07059
655 Boylston St|Boston|MA|02116
1375 bishops terrace|wixom|MI|48393
4705 Center Blvd Apt. 808|Long Island City|NY|11109
5340 CORNELIA HWY|ALTO|GA|30510
1541 Paces Ferry North|Smyrna|GA|30080
603 west pacific coast hwy|wilmington|CA|90744
2503Paddock CT|Louisville|KY|40216
9421 Dunbar dr|Oakland|CA|94603
1804 Third Avenue Apt #8|New York|NY|10029
2504 bellaire st|wantagh|NY|11793
1380 avon lane apt 21|north lauderdale |FL|33068
Try with this:
sed -e 's/\([A-Z]*\) \([A-Z][A-Z]\),/\1\|\2,/g' -e 's/, /\|/g'
it gets all , and subtitutes to |. Prior to that, searches for AAAA AA, and changes it to AAAA|AA, for the City|State part.
Test
$ sed -e 's/\([A-Z]*\) \([A-Z][A-Z]\),/\1\|\2,/g' -e 's/, /\|/g' your_file
719 13th Street East|Glencoe|MN|55336
626 Valley Road|Montclair|NJ|07043
666 EAST DYER ROAD|SANTA ANA|CA|92705
20800 N. 135th Ave|Sun City West|AZ|85375
9775 Herring Gull Drive|Indianapolis|IN|46280
712 21st Street|Vero Beach|FL|32960
PO BOX 324|PORT SALERNO|FL|34992
207 Middleton Road|Lafayette|LA|70503
5091 nw fiddle leaf ct|port saint lucie|FL|34986
347 Mayberry Lane|Dover|DE|19904
2648 SW 137th Ave|Miramar|FL|33027
4410 Williams Dr SUITE 104|Georgetown|TX|78628
17020 Windsor Court|Homer Glen|IL|60491
11 Technology Drive North|Warren|NJ|07059
sed -e 's/, /|/g' -e 's/ \([^ ]\+\)$/|\1/' file