How do I match this pattern in R - regex

I have to match only the first Country name in the pattern below. The country names are given in all upper case letters. I used the following code to get the matches but it matches all the countries.
'\\b[A-Z]{2,}.\\b'
Eg: In the pattern below, I just want UNITED KINGDOM
x = "~ London, Greater London ~ UNITED KINGDOM;~ Ottawa, Ontario ~ CANADA;~,~ AUSTRALIA;~,~ POLAND;~,~ USA"

This seems to work:
regmatches(x, regexpr('\\b[A-Z ]{2,}\\b', x))
# [1] "UNITED KINGDOM"
I just added a space to make the character set [A-Z ]. Note that regexpr gets the first match while gregexpr gets all of them (similar to sub vs gsub).
For more info, I recommend the official docs at ?regexpr.

Related

Convert regex to rust regext. Replace text between nth comma

The goal is to use a regex to remove text between the nth and the next comma in rust.
For example outside of rust I would use
^((?:.*?,){4})[^,]*,(.*)$
on London, City of Westminster, Greater London, England, SW1A 2DX, United Kingdom
to get a desired result like:
London, City of Westminster, Greater London, England, United Kingdom
I don't have a strong understanding of regex in general unfortunately. So I would learn more about the mechanic and be able to use it in the program I'm writing to learn rust.
Just copy pasting it ala
let string = "London, City of Westminster, Greater London, England, United Kingdom"
let re = Regex::new(r"^((?:.*?,){4})[^,]*,(.*)$").unwrap();
re.replace(string, "");
is not working obviously.
The value you want to remove is the fifth comm-delimited value, not the fourth, and you need to replace with two backreferences, $1 and $2 that refer to Group 1 and Group 2 values.
Note it makes it more precise to use a [^,] negated character class rather than a .*? lazy dot in the quantified part since you are running it against a comma-delimited string.
See the Rust demo:
let string = "London, City of Westminster, Greater London, England, SW1A 2DX, United Kingdom";
let re = Regex::new(r"^((?:[^,]*,){4})[^,]*,(.*)").unwrap();
println!("{}", re.replace(string, "$1$2"));
// => London, City of Westminster, Greater London, England, United Kingdom

How do I get all words that begin with a capital letter following a specific string?

I have some text that could look something like this:
Name is William Bob Francis Ford Coppola-Mr-Cool King-Of-The-Mountain is a fake name.
I would like to run a regular expression against that string and pull out
William Bob Francis Ford Coppola-Mr-Cool King-Of-The-Mountain
as a match.
My current regex looks like this:
/\b((NAME\s\s*)(((\s*\,*\s*)? *)(([A-Z\'\-])([A-Za-z\'\-]+)*\s*){2,})?)\b/ig
and it does most of what I want but it's not perfect. Instead of just getting the name, it is also getting the "is a" following the name like this:
"William Bob Francis Ford Coppola-Mr-Cool King-Of-The-Mountain is a"
What is a regex formula to get only the words starting with a capital letter following the "Name" label and end when the next word starts with a lowercase after a space?
How do you like /Name ((?:[A-Z]\w+[ -]?)+)/?
Regex101: https://regex101.com/r/BFJBpZ/1
You can use:
Name\b[\sa-z]*\K(?:[A-Z][a-z]+[\s-]*)+(?=\s[a-z])
where
\K resets the starting point of the matching after having matched Name followed by some words in lower case
(?:[A-Z][a-z]+[\s-]*)+ will match all the words starting with a capital letter
(?=\s[a-z]) add the constraint that the following word starts with a lower case letter
demo: https://regex101.com/r/WBrdFU/1/
Notes:
you shouldn't use the i option in your regex, if you do so all of
your char classes [A-Z] will at the same time match upper case
letters but also lower case letters... This would prevent you from
selecting the words that start with a capital letter!!!
Adding the names with apostrophe:
Name\b[\sa-z]*\K(?:[A-Z][a-z'\s-]*?)+(?=\s[a-z])
demo: https://regex101.com/r/WBrdFU/3/
My guess is that, this simple expression might work, if we always have is after our desired output:
Name is (.+?) is.+
Test
use strict;
my $str = 'Name is William Bob Francis Ford Coppola-Mr-Cool King-Of-The-Mountain is a fake name.
';
my $regex = qr/Name is (.+?) is.+/mp;
if ( $str =~ /$regex/g ) {
print "Whole match is ${^MATCH} and its start/end positions can be obtained via \$-[0] and \$+[0]\n";
# print "Capture Group 1 is $1 and its start/end positions can be obtained via \$-[1] and \$+[1]\n";
# print "Capture Group 2 is $2 ... and so on\n";
}
# ${^POSTMATCH} and ${^PREMATCH} are also available with the use of '/p'
# Named capture groups can be called via $+{name}
Demo
RegEx Circuit
jex.im visualizes regular expressions:
Advice
zdim advises that:
Perhaps, as it may not be "is", just any low-case word (so after a
word boundary), something like /\b([A-Z].+?)\b[a-z.!?]/ ...
(probably needs tweaking, specially for the possible end of sentence
after the name) ?
This worked when I tested with regex101.com. Please check and let me know if this works for you
/Name is (([\s]*[A-Z][-a-z]*)*)/
Group 1 has this William Bob Francis Ford Coppola-Mr-Cool King-Of-The-Mountain
and test it on this link below
https://regex101.com/r/M2V2in/2

Extract different formats street address from a string using RE - Python

I have street address strings in different formats. I tried this old post, but did not help much. My string formats are as follows,
format 1:
string_1 = ', landlord and tenant entered into a an agreement with respect to approximately 5,569 square feet of space in the building known as "the company" located at 788 e.7th street, st. louis, missouri 55605 ( capitalized terms used herein and not otherwise defined herein shall have the respective meanings given to them in the agreement); whereas, the term of the agreement expires on may 30, 2015;'
desired output:
788 e.7th street, st. louis, missouri 55605
format 2:
string_2 = 'first floor 824 6th avenue, chicago, il where the office is located'
desired output:
824 6th avenue, chicago, il
format 3:
string_3 = 'whose address is 90 south seventh street, suite 5400, dubuque, iowa, 55402.'
desired output:
90 south seventh street, suite 5400, dubuque, iowa, 55402
So far, I tried, this for string_1,
address_match_1 = re.findall(r'((\d*)\s+(\d{1,2})(th|nd|rd).*\s([a-z]))', string_1)
I get an empty list.
For the 2nd string I tried the same and getting the empty list as follows,
address_match_2 = re.findall(r'((\d*)\s+(\d{1,2})(th|nd|rd).*\s([a-z]))', string_2)
How can I try to match using re ? They are all in different formats, how can I get suite involved in string_3? Any help would be appreciated.
Solution
This regex matches all addresses in the question:
(?i)\d+ ((?! \d+ ).)*(missouri|il|iowa)(, \d{5}| \d{5}|\b)
You would need to add all of the states and their abbreviations, as well as a better match for the zip code, which you can find if you google it. Also, this will only work for US addresses.
Here is the output for each of the given strings:
>>> m = re.findall(r"((?i)\d+ ((?! \d+ ).)*(missouri|il|iowa)(, \d{5}| \d{5}|\b))", string_1)
>>> print m
[('788 e.7th street, st. louis, missouri 55605', ' ', 'missouri', ' 55605')]
>>> m = re.findall(r"((?i)\d+ ((?! \d+ ).)*(missouri|il|iowa)(, \d{5}| \d{5}|\b))", string_2)
>>> print m
[('824 6th avenue, chicago, il', ' ', 'il', '')]
>>> m = re.findall(r"((?i)\d+ ((?! \d+ ).)*(missouri|il|iowa)(, \d{5}| \d{5}|\b))", string_3)
>>> print m
[('90 south seventh street, suite 5400, dubuque, iowa, 55402', ' ', 'iowa', ', 55402')]
>>>
The first value of each tuple has the correct address. However, this may not be exactly what you need (see Weakness below).
Detail
Assumptions:
Address starts with a number fallowed by a space
Address ends with a state, or its abbreviation, optionally followed by a 5 digit zip code
The rest of the address is in between the two parts above. This part doesn't contain any numbers surrounded by spaces (i.e. with no " \d+ ").
regex string:
r"((?i)\d+ ((?! \d+ ).)*(missouri|il|iowa)(, \d{5}| \d{5}|\b))"
r"" make string a raw string to avoid escaping special characters
(?i) to make regex case insensitive
\d+ address starts with a number followed by a space
(missouri|il|iowa)(, \d{5}| \d{5}|\b)) address ends with state optionally followed by zip code. The \b is just the 'end of word', which makes the zip code optional.
((?! \d+ ).)* any group of characters except for a number surrounded by spaces. Refer to this article for an explanation on how this works.
Weakness
Regular expressions are used to match patterns, but the addresses presented don't have much of a pattern compared with the rest of the string they may be in. Here is the pattern that I identified and that I based the solution on:
Address starts with a number fallowed by a space
Address ends with a state, or its abbreviation, optionally followed by a 5 digit zip code
The rest of the address is in between the two parts above. This part doesn't contain any numbers surrounded by spaces (i.e. with no " \d+ ").
Any address that violates these assumptions won't be matched correctly. For example:
Addresses starting with a number with letters, such as: 102A or 3B.
Addresses with numbers in between initial number and the state, such as one containing ' 7 street' instead of ' 7th street.'
Some of these weaknesses may be fixed with simple changes to the regex, but some may be more difficult to fix.

Extract contents within brackets using R and Regex

I have a data-frame that contains user names in the format
"John Smith (Company Department)"
I want to extract the department from the username to add it to its own separate column.
I have tried the below code but it fails if the user name is something like
"John Smith (Company Department) John Doe)"
Can anyone help. Reg-ex isn't my strong suit and the below code will only work if the username is non standard like my example above with multiple brackets
strcol <- "John Smith (FPO Sales) John Doe)"
start_loc <- str_locate_all(pattern ='\\(FPO ',strcol)[[1]][2]
end_loc <- str_locate_all(pattern ='\\)',strcol)[[1]][2]
substr(strcol,start_loc +1, end_loc -1)))
Expected Output:
Sales
I have also tried the post here using non greedy, but got the following error:
Error: '[' is an unrecognized escape in character string starting ""/["
Note: the company will always be the same
You may use sub
> strcol <- "John Smith (FPO Sales) John Doe)"
> sub(".*\\(FPO[^)]*?(\\w+)\\).*", "\\1", strcol)
[1] "Sales"
.*\\(FPO would match all the characters upto the (FPO
[^)]*? this would match any char but not of ) zero or ore times.
(\\w+)\\) captures one or more word characters exists at the last within the same brackets itself.
.* would match all the remaining characters.
So by replacing all the matched chars with the chars present inside group index 1 will give you the desired output.
OR
> library(stringr)
> str_extract(strcol, perl("FPO[^)]*?\\K\\w+(?=\\))"))
[1] "Sales"
gsub('.*\\s(.*)\\).*\\)$','\\1',strcol)
[1] "Sales"

Regular expression for address field validation

I am trying to write a regular expression that facilitates an address, example 21-big walk way or 21 St.Elizabeth's drive I came up with the following regular expression but I am not too keen to how to incorporate all the characters (alphanumeric, space dash, full stop, apostrophe)
"regexp=^[A-Za-z-0-99999999'
See the answer to this question on address validating with regex:
regex street address match
The problem is, street addresses vary so much in formatting that it's hard to code against them. If you are trying to validate addresses, finding if one isn't valid based on its format is mighty hard to do.
This would return the following address (253 N. Cherry St. ), anything with its same format:
\d{1,5}\s\w.\s(\b\w*\b\s){1,2}\w*\.
This allows 1-5 digits for the house number, a space, a character followed by a period (for N. or S.), 1-2 words for the street name, finished with an abbreviation (like st. or rd.).
Because regex is used to see if things meet a standard or protocol (which you define), you probably wouldn't want to allow for the addresses provided above, especially the first one with the dash, since they aren't very standard. you can modify my above code to allow for them if you wish--you could add
(-?)
to allow for a dash but not require one.
In addition, http://rubular.com/ is a quick and interactive way to learn regex. Try it out with the addresses above.
In case if you don't have a fixed format for the address as mentioned above, I would use regex expression just to eliminate the symbols which are not used in the address (like specialized sybmols - &(%#$^). Result would be:
[A-Za-z0-9'\.\-\s\,]
Just to add to Serzas' answer(since don't have enough reps. to comment).
alphabets and numbers can effectively be replaced by \w for words.
Additionally apostrophe,comma,period and hyphen doesn't necessarily need a backslash.
My requirement also involved front and back slashes so \/ and finally whitespaces with \s. The working regex for me ,as such was :
pattern: "[\w',-\\/.\s]"
Regular expression for simple address validation
^[#.0-9a-zA-Z\s,-]+$
E.g. for Address match case
#1, North Street, Chennai - 11
E.g. for Address not match case
$1, North Street, Chennai # 11
I have succesfully used ;
Dim regexString = New stringbuilder
With regexString
.Append("(?<h>^[\d]+[ ])(?<s>.+$)|") 'find the 2013 1st ambonstreet
.Append("(?<s>^.*?)(?<h>[ ][\d]+[ ])(?<e>[\D]+$)|") 'find the 1-7-4 Dual Ampstreet 130 A
.Append("(?<s>^[\D]+[ ])(?<h>[\d]+)(?<e>.*?$)|") 'find the Terheydenlaan 320 B3
.Append("(?<s>^.*?)(?<h>\d*?$)") 'find the 245e oosterkade 9
End With
Dim Address As Match = Regex.Match(DataRow("customerAddressLine1"), regexString.ToString(), RegexOptions.Multiline)
If Not String.IsNullOrEmpty(Address.Groups("s").Value) Then StreetName = Address.Groups("s").Value
If Not String.IsNullOrEmpty(Address.Groups("h").Value) Then HouseNumber = Address.Groups("h").Value
If Not String.IsNullOrEmpty(Address.Groups("e").Value) Then Extension = Address.Groups("e").Value
The regex will attempt to find a result, if there is none, it move to the next alternative. If no result is found, none of the 4 formats where present.
This one worked for me:
\d+[ ](?:[A-Za-z0-9.-]+[ ]?)+(?:Avenue|Lane|Road|Boulevard|Drive|Street|Ave|Dr|Rd|Blvd|Ln|St)\.?
The source: https://www.codeproject.com/Tips/989012/Validate-and-Find-Addresses-with-RegEx
Regex is a very bad choice for this kind of task. Try to find a web service or an address database or a product which can clean address data instead.
Related:
Address validation using Google Maps API
As a simple one line expression recommend this,
^([a-zA-z0-9/\\''(),-\s]{2,255})$
I needed
STREET # | STREET | CITY | STATE | ZIP
So I wrote the following regex
[0-9]{1,5}( [a-zA-Z.]*){1,4},?( [a-zA-Z]*){1,3},? [a-zA-Z]{2},? [0-9]{5}
This allows
1-5 Street #s
1-4 Street description words
1-3 City words
2 Char State
5 Char Zip code
I also added option , for separating street, city, state, zip
Here is the approach I have taken to finding addresses using regular expressions:
A set of patterns is useful to find many forms that we might expect from an address starting with simply a number followed by set of strings (ex. 1 Basic Road) and then getting more specific such as looking for "P.O. Box", "c/o", "attn:", etc.
Below is a simple test in python. The test will find all the addresses but not the last 4 items which are company names. This example is not comprehensive, but can be altered to suit your needs and catch examples you find in your data.
import re
strings = [
'701 FIFTH AVE',
'2157 Henderson Highway',
'Attn: Patent Docketing',
'HOLLYWOOD, FL 33022-2480',
'1940 DUKE STREET',
'111 MONUMENT CIRCLE, SUITE 3700',
'c/o Armstrong Teasdale LLP',
'1 Almaden Boulevard',
'999 Peachtree Street NE',
'P.O. BOX 2903',
'2040 MAIN STREET',
'300 North Meridian Street',
'465 Columbus Avenue',
'1441 SEAMIST DR.',
'2000 PENNSYLVANIA AVENUE, N.W.',
'465 Columbus Avenue',
'28 STATE STREET',
'P.O, Drawer 800889.',
'2200 CLARENDON BLVD.',
'840 NORTH PLANKINTON AVENUE',
'1025 Connecticut Avenue, NW',
'340 Commercial Street',
'799 Ninth Street, NW',
'11318 Lazarro Ln',
'P.O, Box 65745',
'c/o Ballard Spahr LLP',
'8210 SOUTHPARK TERRACE',
'1130 Connecticut Ave., NW, Suite 420',
'465 Columbus Avenue',
"BANNER & WITCOFF , LTD",
"CHIP LAW GROUP",
"HAMMER & ASSOCIATES, P.C.",
"MH2 TECHNOLOGY LAW GROUP, LLP",
]
patterns = [
"c\/o [\w ]{2,}",
"C\/O [\w ]{2,}",
"P.O\. [\w ]{2,}",
"P.O\, [\w ]{2,}",
"[\w\.]{2,5} BOX [\d]{2,8}",
"^[#\d]{1,7} [\w ]{2,}",
"[A-Z]{2,2} [\d]{5,5}",
"Attn: [\w]{2,}",
"ATTN: [\w]{2,}",
"Attention: [\w]{2,}",
"ATTENTION: [\w]{2,}"
]
contact_list = []
total_count = len(strings)
found_count = 0
for string in strings:
pat_no = 1
for pattern in patterns:
match = re.search(pattern, string.strip())
if match:
print("Item found: " + match.group(0) + " | Pattern no: " + str(pat_no))
found_count += 1
pat_no += 1
print("-- Total: " + str(total_count) + " Found: " + str(found_count))
UiPath Academy training video lists this RegEx for US addresses (and it works fine for me):
\b\d{1,8}(-)?[a-z]?\W[a-z|\W|\.]{1,}\W(road|drive|avenue|boulevard|circle|street|lane|waylrd\.|st\.|dr\.|ave\.|blvd\.|cir\.|In\.|rd|dr|ave|blvd|cir|ln)
I had a different use case - find any addresses in logs and scold application developers (favourite part of a devops job). I had the advantage of having the word "address" in the pattern but should work without that if you have specific field to scan
\baddress.[0-9\\\/# ,a-zA-Z]+[ ,]+[0-9\\\/#, a-zA-Z]{1,}
Look for the word "address" - skip this if not applicable
Look for first part numbers, letters, #, space - Unit Number / street number/suite number/door number
Separated by a space or comma
Look for one or more of rest of address numbers, letters, #, space
Tested against :
1 Sleepy Boulevard PO, Box 65745
Suite #100 /98,North St,Snoozepura
Ave., New Jersey,
Suite 420 1130 Connect Ave., NW,
Suite 420 19 / 21 Old Avenue,
Suite 12, Springfield, VIC 3001
Suite#100/98 North St Snoozepura
This worked for me when there were street addresses with unit/suite numbers, zip codes, only street. It also didn't match IP addresses or mac addresses. Worked with extra spaces.
This assumes users are normal people separate elements of a street address with a comma, hash sign, or space and not psychopaths who use characters like "|" or ":"!
For French address and some international address too, I use it.
[\\D+ || \\d]+\\d+[ ||,||[A-Za-z0-9.-]]+(?:[Rue|Avenue|Lane|... etcd|Ln|St]+[ ]?)+(?:[A-Za-z0-9.-](.*)]?)
I was inspired from the responses given here and came with those 2 solutions
support optional uppercase
support french also
regex structure
numbers (required)
letters, chars and spaces
at least one common address keyword (required)
as many chars you want before the line break
definitions:
accuracy
capacity of detecting addresses and not something that looks like an address which is not.
range
capacity to detect uncommon addresses.
Regex 1:
high accuracy
low range
/[0-9]+[ |[a-zà-ú.,-]* ((highway)|(autoroute)|(north)|(nord)|(south)|(sud)|(east)|(est)|(west)|(ouest)|(avenue)|(lane)|(voie)|(ruelle)|(road)|(rue)|(route)|(drive)|(boulevard)|(circle)|(cercle)|(street)|(cer\.)|(cir\.)|(blvd\.)|(hway\.)|(st\.)|(aut\.)|(ave\.)|(ln\.)|(rd\.)|(hw\.)|(dr\.)|(a\.))([ .,-]*[a-zà-ú0-9]*)*/i
regex 2:
low accuracy
high range
/[0-9]*[ |[a-zà-ú.,-]* ((highway)|(autoroute)|(north)|(nord)|(south)|(sud)|(east)|(est)|(west)|(ouest)|(avenue)|(lane)|(voie)|(ruelle)|(road)|(rue)|(route)|(drive)|(boulevard)|(circle)|(cercle)|(street)|(cer\.?)|(cir\.?)|(blvd\.?)|(hway\.?)|(st\.?)|(aut\.?)|(ave\.?)|(ln\.?)|(rd\.?)|(hw\.?)|(dr\.?)|(a\.))([ .,-]*[a-zà-ú0-9]*)*/i
This one works well for me
^(\d+) ?([A-Za-z](?= ))? (.*?) ([^ ]+?) ?((?<= )APT)? ?((?<= )\d*)?$
Source : https://community.alteryx.com/t5/Alteryx-Designer-Discussions/RegEx-Addresses-different-formats-and-headaches/td-p/360147
Here is my RegEx for address, city & postal validation rules
validation rules:
address -
1 - 40 characters length.
Letters, numbers, space and . , : ' #
city -
1 - 19 characters length
Only Alpha characters are allowed
Spaces are allowed
postalCode -
The USA zip must meet the following criteria and is required:
Minimum of 5 digits (9 digits if zip + 4 is provided)
Numeric only
A Canadian postal code is a six-character string.
in the format A1A 1A1, where A is a letter and 1 is a digit.
a space separates the third and fourth characters.
do not include the letters D, F, I, O, Q or U.
the first position does not make use of the letters W or Z.
address: ^[a-zA-Z0-9 .,#;:'-]{1,40}$
city: ^[a-zA-Z ]{1,19}$
usaPostal: ^([0-9]{5})(?:[-]?([0-9]{4}))?$
canadaPostal : ^(?!.*[DFIOQU])[A-VXY][0-9][A-Z] ?[0-9][A-Z][0-9]$
\b(\d{1,8}[a-z]?[0-9\/#- ,a-zA-Z]+[ ,]+[.0-9\/#, a-zA-Z]{1,})\n
A more dynamic approach to #micah would be the following:
(?'Address'(?'Street'[0-9][a-zA-Z\s]),?\s*(?'City'[A-Za-z\s]),?\s(?'Country'[A-Za-z])\s(?'Zipcode'[0-9]-?[0-9]))
It won't care about individual lengths of segments of code.
https://regex101.com/r/nuy7hB/1