python: Splitting Main address into primary and secondary addresses

python: Splitting Main address into primary and secondary addresses - regex

I need help to create a python function to make Main street address (usually house number and street name) in Address field. Additional address information (Suite, Unit, Space, PO Box, other additional details) saved to Address2
Here are few examples of Address format which need to split.
780 Main Street, P.O. Box 4109 -> 780 Main Street / PO Box 4109
438 University Ave. P.O. Box 5 -> 438 University Ave. / PO Box 5
HIGHWAY 10 BOX 39 -> HIGHWAY 10 / PO Box 39
98 LATHROP ROAD - BOX 147 -> 98 LATHROP ROAD / PO Box 147
396 S MAIN/P.O. BOX 820 -> 396 S MAIN / PO Box 820
HWY 18 AND HWY 128 (BOX 1305) -> HWY 18 AND HWY 128 / PO Box 1305
808 Innisfil Beach Rd Box 2 -> 808 Innisfil Beach Rd / PO Box 2
100 St 101 Ave, P.o. Box 1620 -> 100 St 101 Ave / P.O. Box 1620
201 Del Rio (p.O. Box 309 -> 201 Del Rio / PO Box 309
BOX 487 2054 HWY 1 EAST -> 2054 HWY 1 EAST / PO Box 487
P O BOX 2820 41340 BIG BEAR BL -> 41340 BIG BEAR BL / PO Box 2820
2813 HWY 15 - P O BOX 1083 -> 2813 HWY 15 / PO Box 1083
P.o. Box 838 2540 Hwy 43 West -> 2540 Hwy 43 West / POBox 838
I have tried below code. But It can remove important information from address and leave PO Box data in address (not to move all PO Box data into address2).
input_array = [
'780 Main Street, P.O. Box 410',
'438 University Ave. P.O. Box 5 ',
'HIGHWAY 10 BOX 39',
'98 LATHROP ROAD - BOX 147',
'396 S MAIN/P.O. BOX 820 ',
'HWY 18 AND HWY 128 (BOX 1305)',
'808 Innisfil Beach Rd Box 2',
'100 St 101 Ave, P.o. Box 1620',
'201 Del Rio (p.O. Box 309 ',
'BOX 487 2054 HWY 1 EAST ',
'P O BOX 2820 41340 BIG BEAR BL',
'2813 HWY 15 - P O BOX 1083 ',
'P.o. Box 838 2540 Hwy 43 West'
]
import re
for inputs in input_array:
inputs = (inputs).lower()
for a in (inputs.split(' ')):
if 'box' in a:
box_index = (inputs.split(' ').index(a))
box_num = ((inputs.split(' ')[(inputs.split(' ').index(a)) + 1]))
if (((inputs.split(' ')[(inputs.split(' ').index(a)) + 1])).isdigit()):
if 'p' in ((inputs.split(' ')[(inputs.split(' ').index(a)) - 1])) or 'o' in ((inputs.split(' ')[(inputs.split(' ').index(a)) - 1])):
inputs = inputs.replace(((inputs.split(' ')[(inputs.split(' ').index(a)) - 1])), '')
else:
inputs = inputs.replace(((inputs.split(' ')[(inputs.split(' ').index(a)) + 1])), '')
inputs = inputs.replace(a, '')
inputs = inputs.replace('-', '')
inputs = inputs.replace('/', '')
inputs = inputs.replace(',', '')
print ('address => ',inputs,' address2 => ', 'PO Box ', box_num)
break
Need Improvement in above function to make it more compatible with desired result.

Interesting enough question. Here's regex which works for all of your examples, but I can't say for sure if it will work all the way for your project.
Read more regex documentation and play with regular expressions here.
Here's code:
import re
streets = [
'780 Main Street, P.O. Box 410',
'438 University Ave. P.O. Box 5 ',
'HIGHWAY 10 BOX 39',
'98 LATHROP ROAD - BOX 147',
'396 S MAIN/P.O. BOX 820 ',
'HWY 18 AND HWY 128 (BOX 1305)',
'808 Innisfil Beach Rd Box 2',
'100 St 101 Ave, P.o. Box 1620',
'201 Del Rio (p.O. Box 309 ',
'BOX 487 2054 HWY 1 EAST ',
'P O BOX 2820 41340 BIG BEAR BL',
'2813 HWY 15 - P O BOX 1083 ',
'P.o. Box 838 2540 Hwy 43 West'
]
regex = r'([^a-z0-9]*(p[\s.]?o)?[\s.]*?box (\d+)[^a-z0-9]*)'
for street in streets:
match = re.search(regex, street, flags=re.IGNORECASE)
po_box_chunk = match.group(0)
po_box_number = match.group(3)
cleaned_address = street.strip(po_box_chunk)
result = '{} / PO Box {}'.format(cleaned_address, po_box_number)
print(result)

Related

Select value in column that matches value in list (UPDATED FOR CLARITY)

If I have a column of street addresses and want to select only the address's directional, what syntax would I use to accomplish that in Excel Power Query?
For instance, how do I get "NE" from "357 Pyrite Dr NE" even if the address is incorrectly formatted as "357 NE Pyrite Dr" or "357 Pyrite NE Dr"? Likewise, how would I get "NW" from "506 Mark NW St"?
As far as I can figure out, I would hit add column > custom column and enter a syntax similar to the following...
= if List.ContainsAny([Address], {"NE", "NW", "SE", "SW"}) = TRUE then Text.Select([Address], {"NE", "NW", "SE", "SW"} else null
...except I know that's not the correct syntax since it always produces an error. The same thing happens when I replace "Text.Select" with "List.Select" in the above formula.
For greater clarification, I'm posting the query as it stands now, whittled down to one column from a table with 100 columns and 4000 rows:
let
Source = q_NMAACC,
#"Removed Other Columns" = Table.SelectColumns(Source,{"Address - Street 1", "Address - Street 2"}),
#"Merged Columns" = Table.CombineColumns(#"Removed Other Columns",{"Address - Street 1", "Address - Street 2"},Combiner.CombineTextByDelimiter(" ", QuoteStyle.None),"Street Address"),
#"Trimmed Text" = Table.TransformColumns(#"Merged Columns",{{"Street Address", Text.Trim, type text}}),
#"Filtered Rows" = Table.SelectRows(#"Trimmed Text", each [Street Address] <> null and [Street Address] <> "")
in
#"Filtered Rows"
Here are the first 25 rows to give you some data to work off.
Street Address
PO Box 3416 Nr57 #165a
1016 Copper NE Ave Apt C
217 Garcia St NE
232 17th St SE
560 60th St NW
2935 Madeira Dr NE
9677 Eagle Ranch Rd NW Apt 415
5320 Roanoke Ave NW
17 Hwy 304
HCR 79 Box 46
6524 Camino Rojo
3518 Vail Ave SE
6412 Torreon Dr NE
6136 Flor de Rio Ct NW
1712 36th Street SE
734 Columbia Street
716 Morning Meadows Dr NE
6601 Tennyson St NE Apt 10207
Alamo - Rio Salado PO Box 804
206 Aragon Rd
6901 Verano Ct NW
6709 Siesta Pl NE
10 Meadow Hills Loop
98 Avenida Jardin
6903 Prairie Rd NE Apt 216

Try
let Source = Excel.CurrentWorkbook(){[Name="Table1"]}[Content],
List={"NE","NW","SW","SE"},
LocateTable = Table.FromList(List, null, {"Locate"}),
Find = Table.AddColumn(Source, "Found", (x) => Text.Combine(Table.SelectRows(LocateTable, each Text.Contains(x[Address],[Locate], Comparer.OrdinalIgnoreCase))[Locate],", "))
in Find
You could also use another table to contain the search criteria
let Source = Excel.CurrentWorkbook(){[Name="Table1"]}[Content],
Find = Table.AddColumn(Source, "Found", (x) => Text.Combine(Table.SelectRows(LocateTable, each Text.Contains(x[Address],[Locate], Comparer.OrdinalIgnoreCase))[Locate],", "))
in Find
the , Comparer.OrdinalIgnoreCase part is ignoring case for comparison, which you can remove if you want to match case

PySpark using Regexp_extract and Col to Create Dataset

I need help creating a dataset that shows both the first name and last name of people who live in Texas and the area code of their phone numbers (phone1). This is the coding that I tried to use and this is the dataset that I was given.
from pyspark.sql.functions import regexp_extract, col
regexp_extract(col('first_name + last_name'), '.by\s+(\w+)', 1))
first_name last_name company_name address city county state zip phone1
Billy Thornton Qdoba 8142 Yougla Road Dallas Fort Worth TX 34218 689-956-0765
Joe Swanson Beachfront 9243 Trace Street Miami Dade FL 56432 890-780-9674
Kevin Knox MSG 7683 Brooklyn Ave New York New York NY 56987 850-342-1123
Bill Lamb AFT 6394 W Beast Dr Houston Galveston TX 32804 407-413-4842
Raylene Kampa Hermar Inc 2046 SW Nylin Rd Elkhart Elkhart IN 46514 574-499-1454

Now I see. Your phone number status is good to split, so use split.
df.show()
+----------+---------+------------+-----------------+--------+----------+-----+-----+------------+
|first_name|last_name|company_name| address| city| county|state| zip| phone1|
+----------+---------+------------+-----------------+--------+----------+-----+-----+------------+
| Billy| Thornton| Qdoba| 8142 Yougla Road| Dallas|Fort Worth| TX|34218|689-956-0765|
| Joe| Swanson| Beachfront|9243 Trace Street| Miami| Dade| FL|56432|890-780-9674|
| Kevin| Knox| MSG|7683 Brooklyn Ave|New York| New York| NY|56987|850-342-1123|
| Bill| Lamb| AFT| 6394 W Beast Dr| Houston| Galveston| TX|32804|407-413-4842|
| Raylene| Kampa| Hermar Inc| 2046 SW Nylin Rd| Elkhart| Elkhart| IN|46514|574-499-1454|
+----------+---------+------------+-----------------+--------+----------+-----+-----+------------+
df.filter("state = 'TX'") \
.withColumn('area_code', split('phone1', "-")[0].alias('area_code')) \
.select('first_name', 'last_name', 'state', 'area_code') \
.show()
+----------+---------+-----+---------+
|first_name|last_name|state|area_code|
+----------+---------+-----+---------+
| Billy| Thornton| TX| 689|
| Bill| Lamb| TX| 407|
+----------+---------+-----+---------+

Remove regex pattern from string and store in csv

I am trying to clean up a CSV by using regex. I have accomplished the first part which extracts the regex pattern from the address table and writes it to the street_numb field. The part I need help with is removing that same pattern from the street field so I only end up with the following (i.e., Steinway St, 31 St, 82nd Rd, and 19th St) stored in the street field. Hence these values would be removed (-78, -45, -35, -54) from the street field.
b street_numb street address zipcode
1 246 FIFTH AVE 246 FIFTH AVE 11215
2 30 -78 -78 STEINWAY ST 30 -78 STEINWAY ST 11016
3 25 -45 -45 31ST ST 25 -45 31ST ST 11102
4 123 -35 -35 82ND RD 123 -35 82ND RD 11415
5 22 -54 -54 19TH ST 22 -54 19TH ST 11105
Sample Data (above)
import csv
import re
path = '/Users/darchcruise/Desktop/bldg_zip_codes.csv'
with open(path, 'rU') as infile, open(path+'out.csv', 'w') as outfile:
fieldnames = ['b', 'street_numb', 'street', 'address', 'zipcode']
readablefile = csv.DictReader(infile)
writablefile = csv.DictWriter(outfile, fieldnames=fieldnames)
for row in readablefile:
add = re.match(r'\d+\s*-\s*\d+', row['address'])
if add:
row['street_numb'] = add.group()
# row['street'] = remove re.string (add.group()) from street field
writablefile.writerow(row)
else:
writablefile.writerow(row)
What code in line 12 (# remove re.string from row['street']) could be used to resolve my issue (removing -78, -45, -35, -54 from the street field)?

You can use capturing group with findall like this
[x for x in re.findall("(\d+\s*(-\s*\d+\s+)?)((\w|\s)+)", row['address'])][0][0]-->gives street number
[x for x in re.findall("(\d+\s*(-\s*\d+\s+)?)((\w|\s)+)", row['address'])][0][2]-->gives address

Extracting capital words and extracting the last word in a string

I have a df that looks like this:
df <- data.frame(
x = c(
"800 Block of MAIN ST",
"100 Block of CHESTNUT AV",
"BAY ST / WELLINGTON ST",
"LARKIN ST / ELLIS ST",
"MAPLE ST / WELLINGTON ST",
"MEANDERING RD / MAIN ST"),
y = rnorm(6))
I want to extract the first street name and the last street type.
Desired Output:
x y x.1 x.2
1 800 Block of MAIN ST -0.6745405 MAIN ST
2 100 Block of CHESTNUT AV -1.1316017 CHESTNUT AV
3 BAY ST / WELLINGTON ST 1.2887577 BAY ST
4 LARKIN ST / ELLIS ST 1.4606264 LARKIN ST
5 MAPLE ST / WELLINGTON ST 0.6538595 MAPLE ST
6 MEANDERING RD / MAIN ST 0.8472322 MEANDERING ST

library(stringr)
df[,c("street", "type")] <- list(str_extract(df$x, "[A-Z]{3,}"), str_extract(df$x, "[A-Z]+$"))
# x y street type
# 1 800 Block of MAIN ST 0.7787495 MAIN ST
# 2 100 Block of CHESTNUT AV -0.7069777 CHESTNUT AV
# 3 BAY ST / WELLINGTON ST -0.2365061 BAY ST
# 4 LARKIN ST / ELLIS ST 0.1399500 LARKIN ST
# 5 MAPLE ST / WELLINGTON ST -0.3423978 MAPLE ST
# 6 MEANDERING RD / MAIN ST 0.6434969 MEANDERING ST

df <- within(df, st_name <- sub(".*?([A-Z]{3,}).*", "\\1", x, perl=TRUE))
df <- within(df, st_type <- sub(".+? ([A-Z]+)$", "\\1", x, perl=TRUE))
# x y st_name st_type
#1 800 Block of MAIN ST 1.92908789 MAIN ST
#2 100 Block of CHESTNUT AV 0.02487045 CHESTNUT AV
#3 BAY ST / WELLINGTON ST -2.33411242 BAY ST
#4 LARKIN ST / ELLIS ST -1.17946144 LARKIN ST
#5 MAPLE ST / WELLINGTON ST 0.12913797 MAPLE ST
#6 MEANDERING RD / MAIN ST -0.94150930 MEANDERING ST
Or if you aren't fond of using within:
df$st_name <- sub(".*?([A-Z]{3,}).*", "\\1", df$x, perl=TRUE)
df$st_type <- sub(".+? ([A-Z]+)$", "\\1", df$x, perl=TRUE)

Here's a similar solution using a single regex expression combined with the new tstrsplit function from the development version of data.table
library(data.table) # v1.9.5+
setDT(df)[, c("street", "type") :=
tstrsplit(sub(".*?([A-Z]{3,}).*([A-Z]{2,})", "\\1,\\2", x), ",")]
df
# x y street type
# 1: 800 Block of MAIN ST -1.4391801 MAIN ST
# 2: 100 Block of CHESTNUT AV 1.4917789 CHESTNUT AV
# 3: BAY ST / WELLINGTON ST -0.0369405 BAY ST
# 4: LARKIN ST / ELLIS ST 0.7320230 LARKIN ST
# 5: MAPLE ST / WELLINGTON ST 0.7189120 MAPLE ST
# 6: MEANDERING RD / MAIN ST -0.9836794 MEANDERING ST
Basically, the idea here is to capture both groups within a single sub call, concatenate them with a comma (you can choose something else if you like) and then perform a transpose sting split (tstrsplit) in order to convert them into two separate columns while creating them by reference (using the := operator)

Remove Data from Address Line

I have the following address that I pulled from a database. I am trying to clear everything up until ST|AVE|BLVD. I am trying to get rid of 1ST or the random 1.
9999-1000 N CLARK ST 1 1
4567-5678 W BELMONT AVE
1200 N HAMLIN AVE 1ST 1
8220 W CERMAK RD 1ST
1240 W 69TH ST 1ST
7901 W ADDISON ST 1ST
So that it reads:
1. 9999-1000 N CLARK ST
2. 4567-5678 W BELMONT AVE
3. 1200 N HAMLIN AVE
4. 8220 W CERMAK RD
5. 1240 W 69TH ST
6. 7901 W ADDISON ST

You can try the following regex:
^(.*?)(\s*(?:ST|AVE|BLVD).*)$
Your data is in capturing group 1.
See example here.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

python: Splitting Main address into primary and secondary addresses - regex

Related

Select value in column that matches value in list (UPDATED FOR CLARITY)

PySpark using Regexp_extract and Col to Create Dataset

Remove regex pattern from string and store in csv

Extracting capital words and extracting the last word in a string

Remove Data from Address Line

Categories

Resources