Remove Data from Address Line - regex

I have the following address that I pulled from a database. I am trying to clear everything up until ST|AVE|BLVD. I am trying to get rid of 1ST or the random 1.
9999-1000 N CLARK ST 1 1
4567-5678 W BELMONT AVE
1200 N HAMLIN AVE 1ST 1
8220 W CERMAK RD 1ST
1240 W 69TH ST 1ST
7901 W ADDISON ST 1ST
So that it reads:
1. 9999-1000 N CLARK ST
2. 4567-5678 W BELMONT AVE
3. 1200 N HAMLIN AVE
4. 8220 W CERMAK RD
5. 1240 W 69TH ST
6. 7901 W ADDISON ST

You can try the following regex:
^(.*?)(\s*(?:ST|AVE|BLVD).*)$
Your data is in capturing group 1.
See example here.

Related

REGEX - how to extract a specific number of rows from a text

I need to find out how to extract a specific number of rows from a text( the number of rows that i want to extract would be variable).
In this case, i want to extract anything from 07/06/2021, up to SOLD FINAL ZI 1
TEXT
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb
bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb
cccccccccccccccccccccccccccccccccccccccccccccccc
07/06/2021 P2P 00.00
T d r 0000 R A cc R A
r : aadr
REF. ------------------
P l p 00.00
P XX/XX/XXXX 0000000000 :00000000000 P R R
A B OO 0000000000 v e: 00.00 n 0000000000
c t 0.00 n
REF. ------------------
P2P 00.00
T d r 0000 R A c R A
rr : Saracie
REF. ------------------
P2P 00.00
T d r 0000 A. B c R A rr : Sanity
REF. ------------------
P l p 00.00
P XX/XX/XXXX 0000000000 00000000000 P R R
D OO 0000000000 V T: 00.00 n 0000000000 c
T 0.00 n
REF. ------------------
XX/XX/XXXX RULAJ ZI 1 3
SOLD FINAL ZI 1
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb
bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb
cccccccccccccccccccccccccccccccccccccccccccccccc
In regex, i start with \n(\d{2}/\d{2}/\d{4}) in order to get the data 07/06/2021, but i don't know how to extract the rest.
Thank you in advance!
Hello and welcome to stackoverflow,
your question might not solve your actual problem. Do you REALLY want to "extract a specific number of rows"? This might be a XYProblem.
I like the solution from MDR to extract everything up to SOLD FINAL:
^(\d{2}\/\d{2}\/\d{4})[\s\S]+SOLD FINAL.
I like this because I guess you know the word at the end and not the number of lines. But we can't tell.
Anyway to give you the answer to your question (as your actual problem might look different than we expect) you can use this regex:
^(\d{2}\/\d{2}\/\d{4}).*$(\n^.*$){n}
^ --> look at the beginning of a row
(\d{2}\/\d{2}\/\d{4}) --> your regex for the date
.*$ --> also take the rest of the line
(\n^.*$){n} --> take the next n lines
\n --> the line break
^ --> again: beginning of a new line
.* --> as much characters as needed to match the next (non greedy)
$ --> the end of a line
{n}--> the number of lines you want to extract (replace n ;) )

R - How do document the number of grepl matches based in another data frame?

This is a rather tricky question indeed. It would be awesome if someone might be able to help me out.
What I'm trying to do is the following. I have data frame in R containing every locality in a given state, scraped from Wikipedia. It looks something like this (top 10 rows). Let's call it NewHampshire.df:
Municipality County Population
1 Acworth Sullivan 891
2 Albany Carroll 735
3 Alexandria Grafton 1613
4 Allenstown Merrimack 4322
5 Alstead Cheshire 1937
6 Alton Belknap 5250
7 Amherst Hillsborough 11201
8 Andover Merrimack 2371
9 Antrim Hillsborough 2637
10 Ashland Grafton 2076
I've further compiled a new variable called grep_term, which combines the values from Municipality and County into a new, variable that functions as an or-statement, something like this:
Municipality County Population grep_term
1 Acworth Sullivan 891 "Acworth|Sullivan"
2 Albany Carroll 735 "Albany|Carroll"
and so on. Furthermore, I have another dataset, containing self-disclosed locations of 2000 Twitter users. I call it location.df, and it looks a bit like this:
[1] "London" "Orleans village VT USA" "The World"
[4] "D M V Towson " "Playa del Sol Solidaridad" "Beautiful Downtown Burbank"
[7] NA "US" "Gaithersburg Md"
[10] NA "California " "Indy"
[13] "Florida" "exsnaveen com" "Houston TX"
I want to do two things:
1: Grepl through every observation in the location.df dataset, and save a TRUE or FALSE into a new variable depending on whether the self-disclosed location is part of the list in the first dataset.
2: Save the number of matches for a particular line in the NewHampshire.df dataset to a new variable. I.e., if there are 4 matches for Acworth in the twitter location dataset, there should be a value "4" for observation 1 in the NewHampshire.df on the newly created "matches" variable
What I've done so far: I've solved task 1, as follows:
for(i in 1:234){
location.df$isRelevant <- sapply(location.df$location, function(s) grepl(NH_Places[i], s, ignore.case = TRUE))
}
How can I solve task 2, ideally in the same for loop?
Thanks in advance, any help would be greatly appreciated!
With regard to task one, you could also use:
# location vector to be matched against
loc.vec <- c("Acworth","Hillsborough","California","Amherst","Grafton","Ashland","London")
location.df <- data.frame(location=loc.vec)
# create a 'grep-vector'
places <- paste(paste(NewHampshire$Municipality, NewHampshire$County,
sep = "|"),
collapse = "|")
# match them against the available locations
location.df$isRelevant <- sapply(location.df$location,
function(s) grepl(places, s, ignore.case = TRUE))
which gives:
> location.df
location isRelevant
1 Acworth TRUE
2 Hillsborough TRUE
3 California FALSE
4 Amherst TRUE
5 Grafton TRUE
6 Ashland TRUE
7 London FALSE
To get the number of matches in the location.df with the grep_term column, you can use:
NewHampshire$n.matches <- sapply(NewHampshire$grep_term, function(x) sum(grepl(x, loc.vec)))
gives:
> NewHampshire
Municipality County Population grep_term n.matches
1 Acworth Sullivan 891 Acworth|Sullivan 1
2 Albany Carroll 735 Albany|Carroll 0
3 Alexandria Grafton 1613 Alexandria|Grafton 1
4 Allenstown Merrimack 4322 Allenstown|Merrimack 0
5 Alstead Cheshire 1937 Alstead|Cheshire 0
6 Alton Belknap 5250 Alton|Belknap 0
7 Amherst Hillsborough 11201 Amherst|Hillsborough 2
8 Andover Merrimack 2371 Andover|Merrimack 0
9 Antrim Hillsborough 2637 Antrim|Hillsborough 1
10 Ashland Grafton 2076 Ashland|Grafton 2

Remove regex pattern from string and store in csv

I am trying to clean up a CSV by using regex. I have accomplished the first part which extracts the regex pattern from the address table and writes it to the street_numb field. The part I need help with is removing that same pattern from the street field so I only end up with the following (i.e., Steinway St, 31 St, 82nd Rd, and 19th St) stored in the street field. Hence these values would be removed (-78, -45, -35, -54) from the street field.
b street_numb street address zipcode
1 246 FIFTH AVE 246 FIFTH AVE 11215
2 30 -78 -78 STEINWAY ST 30 -78 STEINWAY ST 11016
3 25 -45 -45 31ST ST 25 -45 31ST ST 11102
4 123 -35 -35 82ND RD 123 -35 82ND RD 11415
5 22 -54 -54 19TH ST 22 -54 19TH ST 11105
Sample Data (above)
import csv
import re
path = '/Users/darchcruise/Desktop/bldg_zip_codes.csv'
with open(path, 'rU') as infile, open(path+'out.csv', 'w') as outfile:
fieldnames = ['b', 'street_numb', 'street', 'address', 'zipcode']
readablefile = csv.DictReader(infile)
writablefile = csv.DictWriter(outfile, fieldnames=fieldnames)
for row in readablefile:
add = re.match(r'\d+\s*-\s*\d+', row['address'])
if add:
row['street_numb'] = add.group()
# row['street'] = remove re.string (add.group()) from street field
writablefile.writerow(row)
else:
writablefile.writerow(row)
What code in line 12 (# remove re.string from row['street']) could be used to resolve my issue (removing -78, -45, -35, -54 from the street field)?
You can use capturing group with findall like this
[x for x in re.findall("(\d+\s*(-\s*\d+\s+)?)((\w|\s)+)", row['address'])][0][0]-->gives street number
[x for x in re.findall("(\d+\s*(-\s*\d+\s+)?)((\w|\s)+)", row['address'])][0][2]-->gives address

Extracting capital words and extracting the last word in a string

I have a df that looks like this:
df <- data.frame(
x = c(
"800 Block of MAIN ST",
"100 Block of CHESTNUT AV",
"BAY ST / WELLINGTON ST",
"LARKIN ST / ELLIS ST",
"MAPLE ST / WELLINGTON ST",
"MEANDERING RD / MAIN ST"),
y = rnorm(6))
I want to extract the first street name and the last street type.
Desired Output:
x y x.1 x.2
1 800 Block of MAIN ST -0.6745405 MAIN ST
2 100 Block of CHESTNUT AV -1.1316017 CHESTNUT AV
3 BAY ST / WELLINGTON ST 1.2887577 BAY ST
4 LARKIN ST / ELLIS ST 1.4606264 LARKIN ST
5 MAPLE ST / WELLINGTON ST 0.6538595 MAPLE ST
6 MEANDERING RD / MAIN ST 0.8472322 MEANDERING ST
library(stringr)
df[,c("street", "type")] <- list(str_extract(df$x, "[A-Z]{3,}"), str_extract(df$x, "[A-Z]+$"))
# x y street type
# 1 800 Block of MAIN ST 0.7787495 MAIN ST
# 2 100 Block of CHESTNUT AV -0.7069777 CHESTNUT AV
# 3 BAY ST / WELLINGTON ST -0.2365061 BAY ST
# 4 LARKIN ST / ELLIS ST 0.1399500 LARKIN ST
# 5 MAPLE ST / WELLINGTON ST -0.3423978 MAPLE ST
# 6 MEANDERING RD / MAIN ST 0.6434969 MEANDERING ST
df <- within(df, st_name <- sub(".*?([A-Z]{3,}).*", "\\1", x, perl=TRUE))
df <- within(df, st_type <- sub(".+? ([A-Z]+)$", "\\1", x, perl=TRUE))
# x y st_name st_type
#1 800 Block of MAIN ST 1.92908789 MAIN ST
#2 100 Block of CHESTNUT AV 0.02487045 CHESTNUT AV
#3 BAY ST / WELLINGTON ST -2.33411242 BAY ST
#4 LARKIN ST / ELLIS ST -1.17946144 LARKIN ST
#5 MAPLE ST / WELLINGTON ST 0.12913797 MAPLE ST
#6 MEANDERING RD / MAIN ST -0.94150930 MEANDERING ST
Or if you aren't fond of using within:
df$st_name <- sub(".*?([A-Z]{3,}).*", "\\1", df$x, perl=TRUE)
df$st_type <- sub(".+? ([A-Z]+)$", "\\1", df$x, perl=TRUE)
Here's a similar solution using a single regex expression combined with the new tstrsplit function from the development version of data.table
library(data.table) # v1.9.5+
setDT(df)[, c("street", "type") :=
tstrsplit(sub(".*?([A-Z]{3,}).*([A-Z]{2,})", "\\1,\\2", x), ",")]
df
# x y street type
# 1: 800 Block of MAIN ST -1.4391801 MAIN ST
# 2: 100 Block of CHESTNUT AV 1.4917789 CHESTNUT AV
# 3: BAY ST / WELLINGTON ST -0.0369405 BAY ST
# 4: LARKIN ST / ELLIS ST 0.7320230 LARKIN ST
# 5: MAPLE ST / WELLINGTON ST 0.7189120 MAPLE ST
# 6: MEANDERING RD / MAIN ST -0.9836794 MEANDERING ST
Basically, the idea here is to capture both groups within a single sub call, concatenate them with a comma (you can choose something else if you like) and then perform a transpose sting split (tstrsplit) in order to convert them into two separate columns while creating them by reference (using the := operator)

Repeating Capture Groups Regex

I have a large chunk of class data that I need to run a regular expression on and get data back from. The problem is that I need a repeating capturing group in order to acomplish that.
Womn St 157A QUEERHISTORY MAKING
CCode Typ Sec Unt Instructor Time Place Max Enr Req Rstr Status
32680 LEC A 4 SHAH, P. TuTh 11:00-12:20p IAB 131 35 37 60 FULL
Womn St 171 SEX/RACE & CONQUEST
CCode Typ Sec Unt Instructor Time Place Max Enr Req Rstr Status
32710 LEC A 4 O'TOOLE, R. TuTh 2:00- 3:20p DBH 1300 52 13/45 24 OPEN
~ Same as 25610 (GlblClt 103B, Lec A); 26350 (History 169, Lec A); and
~ 60320 (Anthro 139, Lec B).
32711 DIS 1 0 MONSON, A. W 9:00- 9:50 HH 105 25 5/23 8 OPEN
O'TOOLE, R.
~ Same as 25612 (GlblClt 103B, Dis 1); 26351 (History 169, Dis 1); and
~ 60321 (Anthro 139, Dis 1).
The result I need would return two matches
Match
Group1:Womn St 157A
Group2:QUEERHISTORY MAKING
Group3:32680
Group4:LEC
Group5:A
Group6:SHAH, P.
Group7:TuTh 11:00-12:20p
Group8:IAB 13
Match
Group1:Womn St 171
Group2:SEX/RACE & CONQUEST
Group3:32710
Group4:LEC
Group5:A
Group6:O'TOOLE, R.
Group7:TuTh 2:00- 3:20p
Group8:DBH 1300
Group9:25610
Group10:26350
Group11:60320
Group12:32711
Group13:DIS
Group14:1
Group15:MONSON, A.
Group16: W 9:00- 9:50
Group17:HH 105
Group18:25612
Group19:26351
Group20:60321