Regex pattern matching for inconsistent address patterns in large dat file - regex

I know it can't be perfect but I am not very good with regex and I'm having difficulties getting a better matching percentage.
I have a file that has over 9 million rows and the addresses are very inconsistent. I was wondering if I could get some help from the people here that are better than me. Any help would be greatly appreciated.
This is what I have so far. I thought the best way to attack this would be to try to match the pattern from the end of the string since apt,bx, po box, etc could be at the start of the string.
/(\d+\-\d+\s+|\d+-\D+|APT\s\D|APT\s\d+|APT\s\D\d+|APT\s\D\s\d+|SPACE\s\d+|POBOX\s\d+|BX|UNIT\s\d+|\d+-\d+|\d+)\s(.+)\s{2,}(\D+)\s(\D{2})$/
Several patterns that I can see. The large number of spaces is as in the file. I tried splitting on 2 spaces or more as well as in the regex I have thus far.
F_NAME L_NAMEFOR F_NAME L_NAME ADDRESS ZIP CITY STATE
ADDRESS CITY STATE
ADDRESS EAST/WEST/NORTH/SOUTH/E/W/N/S CITY STATE
APT # ADDRESS EAST/WEST/NORTH/SOUTH/E/W/N/S CITY STATE
P O BOX # ADDRESS CITY STATE
APT DIGIT# ADDRESS CITY STATE
SPACE DIGIT ADDRESS CITY STATE
UNIT # ADDRESS CITY STATE
SP DIGIT ADDRESS CITY STATE
DIGITS-DIGITS ADDRESS CITY STATE
BX DIGIT ADDRESS CITY STATE
ADDRESS APT # CITY STATE
ADDRESS UNIT # CITY STATE
ADDRESS P O BOX DIGIT CITY STATE
P O B O X DIGIT CITY STATE
P O BOX DIGIT CITY STATE
ADDRESS SPACE/SP/SPC/UNIT DIGIT CITY STATE

This is a rather complex problem which sadly won't have a simple solution.
You could try the following regex admittedly far from perfect:
^.*?(?<address>(?:\b(?:[a-zA-Z0-9.,:;\\\/#-]|\s(?=\S))*?(?<zip>\d{5}(?:-\d{4}|-\d{6})?)?\b)?)\s{2,}(?<city>\b(?:\w|\s(?=\S))+\b)\s{1,}(?<state>\b\w{2,3}\b)(?:$|\r|\n)
In the image, group 1 = address; group 2 = zip; group 3 = city; group 4 = state
Input, note I changed STATE to st; zip to 12345; and po box digits to actual digits
F_NAME L_NAMEFOR F_NAME L_NAME ADDRESS 12345 CITY st
ADDRESS CITY st
ADDRESS EAST/WEST/NORTH/SOUTH/E/W/N/S CITY st
APT # ADDRESS EAST/WEST/NORTH/SOUTH/E/W/N/S CITY st
P O BOX # 1234 ADDRESS CITY st
APT DIGIT# ADDRESS CITY st
SPACE DIGIT ADDRESS CITY st
UNIT # ADDRESS CITY st
SP DIGIT ADDRESS CITY st
DIGITS-DIGITS ADDRESS CITY st
BX DIGIT ADDRESS CITY st
ADDRESS APT # CITY st
ADDRESS UNIT # CITY st
ADDRESS P O BOX 3245 CITY st
P O B O X 123 CITY st
P O BOX 345 CITY st
ADDRESS SPACE/SP/SPC/UNIT DIGIT CITY st
Matches
[0] => Array
(
[0] => F_NAME L_NAMEFOR F_NAME L_NAME ADDRESS 12345 CITY st
[1] => ADDRESS CITY st
[2] => ADDRESS EAST/WEST/NORTH/SOUTH/E/W/N/S CITY st
[3] => APT # ADDRESS EAST/WEST/NORTH/SOUTH/E/W/N/S CITY st
[4] => P O BOX # 1234 ADDRESS CITY st
[5] => APT DIGIT# ADDRESS CITY st
[6] => SPACE DIGIT ADDRESS CITY st
[7] => UNIT # ADDRESS CITY st
[8] => SP DIGIT ADDRESS CITY st
[9] => DIGITS-DIGITS ADDRESS CITY st
[10] => BX DIGIT ADDRESS CITY st
[11] => ADDRESS APT # CITY st
[12] => ADDRESS UNIT # CITY st
[13] => ADDRESS P O BOX DIGIT CITY st
[14] => P O B O X 123 CITY st
[15] => P O BOX 345 CITY st
[16] => ADDRESS SPACE/SP/SPC/UNIT DIGIT CITY st
)
[address] => Array
(
[0] => ADDRESS 12345
[1] => ADDRESS
[2] => ADDRESS EAST/WEST/NORTH/SOUTH/E/W/N/S
[3] => ADDRESS EAST/WEST/NORTH/SOUTH/E/W/N/S
[4] => ADDRESS
[5] => APT DIGIT#
[6] => ADDRESS
[7] => ADDRESS
[8] => ADDRESS
[9] => DIGITS-DIGITS ADDRESS
[10] => ADDRESS
[11] => APT #
[12] => UNIT #
[13] => DIGIT
[14] => 123
[15] => P O BOX 345
[16] => SPACE/SP/SPC/UNIT DIGIT
)
[zip] => Array
(
[0] => 12345
[1] =>
[2] =>
[3] =>
[4] =>
[5] =>
[6] =>
[7] =>
[8] =>
[9] =>
[10] =>
[11] =>
[12] =>
[13] =>
[14] =>
[15] =>
[16] =>
)
[city] => Array
(
[0] => CITY
[1] => CITY
[2] => CITY
[3] => CITY
[4] => CITY
[5] => ADDRESS CITY
[6] => CITY
[7] => CITY
[8] => CITY
[9] => CITY
[10] => CITY
[11] => CITY
[12] => CITY
[13] => CITY
[14] => CITY
[15] => CITY
[16] => CITY
)
[state] => Array
(
[0] => st
[1] => st
[2] => st
[3] => st
[4] => st
[5] => st
[6] => st
[7] => st
[8] => st
[9] => st
[10] => st
[11] => st
[12] => st
[13] => st
[14] => st
[15] => st
[16] => st
)
Recommend having a look at question 11160192

Denomales' answer is quite sufficient for your needs I think, but I'm going to expand my comment above into an answer since I think there are some relevant pieces specific to your question.
Are they US addresses? You could try an API or tool to extract the addresses en-masse. Here's an example of such a tool from another Stack Overflow answer recently, which had a small list of addresses to match:
For disclosure, I work at SmartyStreets and helped to develop this. While it's not designed specifically with spreadsheet or tabular address data in mind, it was designed for non-uniform input like freeform text. You can even splice millions of rows into the service in pieces.
Perhaps this will be helpful as it validates the addresses too, after it finds them in text. Addresses are real gnarly, as you're discovering, and a dedicated tool can sometimes be the best way to handle them. Not saying this is the correct answer for your case, but hopefully still informative.

Related

Find all words that have "<-" at the end of the word OR in front of a dot

How do I pull out all words that have the symbol "<-" either at the end of the word or somewhere in between but in the latter case only if the "<-" symbol is followed by a dot.
To put it into context. Exercise 6.5.3 a. of Hadley Wickhams - Advanced R asks the reader to list all replacement functions in the base package.
Replacement function that only have one method are indicated by the symbol <-
right at the end of the function name. Generic functions, however, have their
method name attached to the name of the replacement form (with a dot), such that the <- is no longer at the end of the function name. Example split<-.data.frame
EDIT:
obj <- mget(ls("package:base"), inherits = TRUE)
funs <- Filter(is.function, objs)
This is how you pull out all functions in the base package. Now I want to find only the replacement functions.
If you want all base package replacement functions and their respective S3 methods, you can try
ls(envir = as.environment("package:base"), pattern = "<-")
With no packages loaded, this gives the following result:
[1] "<<-" "<-" "[<-"
[4] "[[<-" "#<-" "$<-"
[7] "attr<-" "attributes<-" "body<-"
[10] "class<-" "colnames<-" "comment<-"
[13] "[<-.data.frame" "[[<-.data.frame" "$<-.data.frame"
[16] "[<-.Date" "diag<-" "dim<-"
[19] "dimnames<-" "dimnames<-.data.frame" "Encoding<-"
[22] "environment<-" "[<-.factor" "[[<-.factor"
[25] "formals<-" "is.na<-" "is.na<-.default"
[28] "is.na<-.factor" "is.na<-.numeric_version" "length<-"
[31] "length<-.factor" "levels<-" "levels<-.factor"
[34] "mode<-" "mostattributes<-" "names<-"
[37] "names<-.POSIXlt" "[<-.numeric_version" "[[<-.numeric_version"
[40] "oldClass<-" "parent.env<-" "[<-.POSIXct"
[43] "[<-.POSIXlt" "regmatches<-" "row.names<-"
[46] "rownames<-" "row.names<-.data.frame" "row.names<-.default"
[49] "split<-" "split<-.data.frame" "split<-.default"
[52] "storage.mode<-" "substr<-" "substring<-"
[55] "units<-" "units<-.difftime"
Thanks to #42 for helping me improve this answer.
We can try
library(stringr)
str_extract(v1, "\\w+<-$|\\w*<-\\.\\S+")
#[1] "split<-.data.frame" NA "splitdata<-"
data
v1 <- c("split<-.data.frame", "split<-data", "splitdata<-")

Passing URL through Command Line(C++)

I have c++ code which parses 2 command line arguments and prints the arguments. One of the argument is an URL of google search. I paste the code below
int main(int argc, char* argv[])
{
std::cout << argv[1] << argv[2] << "\n";
}
When I pass URL through command line after compilation as below,
./demo 1 https://www.google.co.in/search?sourceid=chrome-psyapi2&ion=1&espv=2&ie=UTF-8&client=ubuntu&q=size%20of%20unsigned%20char%20array%20c%2B%2B&oq=length%20of%20unsigned%20char*%20arra&aqs=chrome.4.69i57j0l5.13353j0j7
I get the output as,
[1] 8680
[2] 8681
[3] 8682
[4] 8683
[5] 8684
[6] 8685
[7] 8686
[2] Done ion=1
[3] Done espv=2
[4] Done ie=UTF-8
[6]- Done q=size%20of%20unsigned%20char%20array%20c%2B%2B
It looks like there has been some internal splitting of the string. Is there any way I can retrieve the entire string?
Thank You in advance.
You have to quote it. Otherwise & gets interpreted by the shell as "invoke what's on the left of & in background".
I took the privilege of replacing your program with echo.
Good:
$ echo "https://www.google.co.in/search?sourceid=chrome-psyapi2&ion=1&espv=2&ie=UTF-8&client=ubuntu&q=size%20of%20unsigned%20char%20array%20c%2B%2B&oq=length%20of%20unsigned%20char*%20arra&aqs=chrome.4.69i57j0l5.13353j0j7"
https://www.google.co.in/search?sourceid=chrome-psyapi2&ion=1&espv=2&ie=UTF-8&client=ubuntu&q=size%20of%20unsigned%20char%20array%20c%2B%2B&oq=length%20of%20unsigned%20char*%20arra&aqs=chrome.4.69i57j0l5.13353j0j7
Bad:
$ echo https://www.google.co.in/search?sourceid=chrome-psyapi2&ion=1&espv=2&ie=UTF-8&client=ubuntu&q=size%20of%20unsigned%20char%20array%20c%2B%2B&oq=length%20of%20unsigned%20char*%20arra&aqs=chrome.4.69i57j0l5.13353j0j7
[1] 21705
[2] 21706
https://www.google.co.in/search?sourceid=chrome-psyapi2
[3] 21707
[4] 21708
[5] 21709
[6] 21710
[7] 21711
[1] Done echo https://www.google.co.in/search?sourceid=chrome-psyapi2
[2] Done ion=1
[3] Done espv=2
[4] Done ie=UTF-8
[5] Done client=ubuntu
[6]- Done q=size%20of%20unsigned%20char%20array%20c%2B%2B
[7]+ Done oq=length%20of%20unsigned%20char*%20arra
You need to quote the argument, and you should use single quotes, ', in order to stop your shell from attempting to evaluate anything inside it.
What happens is that every ampersand, "&", on your command line launches a background process.
The first process is ./demo 1 https://www.google.co.in/search?sourceid=chrome-psyapi2, and all the following are assignments to variables.
You can see from the output (it looks like you didn't post all of it)
[1] 8680
[2] 8681
[3] 8682
[4] 8683
[5] 8684
[6] 8685
[7] 8686
[2] Done ion=1
[3] Done espv=2
[4] Done ie=UTF-8
[6]- Done q=size%20of%20unsigned%20char%20array%20c%2B%2B
that background process 2 is ion=1 (pid 8681), process 3 (pid 8682) is espv=2, and so on.

How to use separate() properly?

I have some difficulties to extract an ID in the form:
27da12ce-85fe-3f28-92f9-e5235a5cf6ac
from a data frame:
a<-c("NAME_27da12ce-85fe-3f28-92f9-e5235a5cf6ac_THOMAS_MYR",
"NAME_94773a8c-b71d-3be6-b57e-db9d8740bb98_THIMO",
"NAME_1ed571b4-1aef-3fe2-8f85-b757da2436ee_ALEX",
"NAME_9fbeda37-0e4f-37aa-86ef-11f907812397_JOHN_TYA",
"NAME_83ef784f-3128-35a1-8ff9-daab1c5f944b_BISHOP",
"NAME_39de28ca-5eca-3e6c-b5ea-5b82784cc6f4_DUE_TO",
"NAME_0a52a024-9305-3bf1-a0a6-84b009cc5af4_WIS_MICHAL",
"NAME_2520ebbb-7900-32c9-9f2d-178cf04f7efc_Sarah_Lu_Van_Gar/Thomas")
Basically its the thing between the first and the second underscore.
Usually I approach that by:
library(tidyr)
df$a<-as.character(df$a)
df<-df[grep("_", df$a), ]
df<- separate(df, a, c("ID","Name") , sep = "_")
df$a<-as.numeric(df$ID)
However this time there a to many underscores...and my approach fails. Is there a way to extract that ID?
I think you should use extract instead of separate. You need to specify the patterns which you want to capture. I'm assuming here that ID is always starts with a number so I'm capturing everything after the first number until the next _ and then everything after it
df <- data.frame(a)
df <- df[grep("_", df$a),, drop = FALSE]
extract(df, a, c("ID", "NAME"), "[A-Za-z].*?(\\d.*?)_(.*)")
# ID NAME
# 1 27da12ce-85fe-3f28-92f9-e5235a5cf6ac THOMAS_MYR
# 2 94773a8c-b71d-3be6-b57e-db9d8740bb98 THIMO
# 3 1ed571b4-1aef-3fe2-8f85-b757da2436ee ALEX
# 4 9fbeda37-0e4f-37aa-86ef-11f907812397 JOHN_TYA
# 5 83ef784f-3128-35a1-8ff9-daab1c5f944b BISHOP
# 6 39de28ca-5eca-3e6c-b5ea-5b82784cc6f4 DUE_TO
# 7 0a52a024-9305-3bf1-a0a6-84b009cc5af4 WIS_MICHAL
# 8 2520ebbb-7900-32c9-9f2d-178cf04f7efc Sarah_Lu_Van_Gar/Thomas
try this (which assumes that the ID is always the part after the first unerscore):
sapply(strsplit(a, "_"), function(x) x[[2]])
which gives you "the middle part" which is your ID:
[1] "27da12ce-85fe-3f28-92f9-e5235a5cf6ac" "94773a8c-b71d-3be6-b57e-db9d8740bb98"
[3] "1ed571b4-1aef-3fe2-8f85-b757da2436ee" "9fbeda37-0e4f-37aa-86ef-11f907812397"
[5] "83ef784f-3128-35a1-8ff9-daab1c5f944b" "39de28ca-5eca-3e6c-b5ea-5b82784cc6f4"
[7] "0a52a024-9305-3bf1-a0a6-84b009cc5af4" "2520ebbb-7900-32c9-9f2d-178cf04f7efc"
if you want to get the Name as well a simple solution would be (which assumes that the Name is always after the second underscore):
Names <- sapply(strsplit(a, "_"), function(x) Reduce(paste, x[-c(1,2)]))
which gives you this:
[1] "THOMAS MYR" "THIMO" "ALEX" "JOHN TYA"
[5] "BISHOP" "DUE TO" "WIS MICHAL" "Sarah Lu Van Gar/Thomas"

Extract e-mail address from string using r

These are 5 twitter user descriptions. The idea is to extract the e-mail from each string.
This is the code i've tried, it works but there is probably something better.
I'd rather avoid using unlist() and do it in one go using regex. I've seen other questions of the kind for python/perl/php but not for R.
I know i could use grep(..., perl = TRUE) but that should't be the only way to do it.
If it works, of course it helps.
ds <- c("#MillonMusical | #PromotorMusical | #Diseñador | Contacto : ezequielife#gmail.com | #Instagram : Ezeqielgram | 01-11-11 | #_MillonMusical #flowfestar", "LipGLosSTudio by: SAndry RUbio Maquilladora PRofesional estudiande de diseño profesional de maquillaje artistico lipglosstudio#hotmail.com/", "Medico General Barranquillero radicado con su familia en Buenos Aires para iniciar Especialidad Medico Quirurgica. email jaenpavi#hotmail.com", "msn =
rdt031169#hotmail.comskype = ronaldotorres-br", "Aguante piscis / manuarias17#gmail.com buenos aires"
)
ds <- unlist(strsplit(ds, ' '))
ds <- ds[grep("mail.", ds)]
> print(ds)
[1] "\t\tezequielife#gmail.com" "lipglosstudio#hotmail.com/"
[3] "jaenpavi#hotmail.com" "rdt031169#hotmail.comskype"
[5] "/\t\tmanuarias17#gmail.com"
It would be nice to separate this one "rdt031169#hotmail.comskype"
perhaps asking it to end in .com or .com.ar that would make sense for what i'm working on
Here's one alternative:
> regmatches(ds, regexpr("[[:alnum:]]+\\#[[:alpha:]]+\\.com", ds))
[1] "ezequielife#gmail.com" "lipglosstudio#hotmail.com" "jaenpavi#hotmail.com" "rdt031169#hotmail.com"
[5] "manuarias17#gmail.com"
Based on #Frank's comment, if you want to keep country identifier after .com as in your example .com.ar then, look at this:
> ds <- c(ds, "fulanito13#somemail.com.ar") # a new e-mail address
> regmatches(ds, regexpr("[[:alnum:]]+\\#[[:alpha:]]+\\.com(\\.[a-z]{2})?", ds))
[1] "ezequielife#gmail.com" "lipglosstudio#hotmail.com" "jaenpavi#hotmail.com" "rdt031169#hotmail.com"
[5] "manuarias17#gmail.com" "fulanito13#somemail.com.ar"

Difficult Regexp

I need a regexp which does the following:
Heres the name of an HTML input field:
lm[0][ti]
I need to find the basic name ("lm"). Only if the name contains brackets I need to find the string in the second brackets ("ti").
To get it in portions is easy with the following regexp:
([a-zA-Z\d_]+)\[?([0-9]*)\]?\[?([a-zA_Z\d_]+)\]?
It matches all the portions I need.
Array
(
[0] => lm[0][ti]
[1] => lm
[2] => 0
[3] => ti
)
But if the HTML input name was just "lm", using this regexp I cannot determine that item #4 in the array is a valid name. The array would look like this:
Array
(
[0] => lm
[1] => l
[2] =>
[3] => m
)
"m" is not valid for me, I'd like to get this array:
Array
(
[0] => lm
[1] =>
[2] =>
[3] =>
)
or this
Array
(
[0] => lm
)
You can test the regexp here:
http://regexp-tester.mediacix.de/exp/regex/
Thanks for support in finding the right regexp...
Try this:
(\w+)(?:\[(\d+)\])?(?:\[(\w+)\])?
Input:
lm[0][ti]
Output:
Input:
lm
Output: