Match 'State' text if written out or 2 digit? - regex

New user, not much experience with regex.
I cannot match the state in the following text:
Lead status
53 W 70th St
New York, New York 10023
I have the following code
Lead status\s+(\w+.)\v+(\w+[^,]+).(\d{5})
I'd like to end up with
Street Address,
City,
State (whether spelled out or 2 digits),
Zip
I am only getting
Street Address,
City,
Zip
I appreciate any help.
Cheers

Related

How can a regex catch all parts before a keyword from a finite set, but sometimes separated only by a single space

This question relates to PCRE regular expressions.
Part of my big dataset are address data like this:
12H MARKET ST. Canada
123 SW 4TH Street USA
ONE HOUSE USA
1234 Quantity Dr USA
123 Quality Court Canada
1234 W HWY 56A USA
12345 BERNARDO CNTR DRIVE Canada
12 VILLAGE PLAZA USA
1234 WEST SAND LAKE RD ?567 USA
1234 TELEGRAM BLVD SUITE D USA
1234-A SOUTHWEST FRWY USA
123 CHURCH STREET USA
123 S WASHINGTON USA
123 NW-SE BLVD USA
# USA
1234 E MAIN STREET USA
I would like to extract the street names including house numbers and additional information from these records. (Of course there are other things in those records and I already know how to extract them).
For the purpose of this question I just manually clipped the interesting part from the data for this example.
The number of words in the address parts is not known before. The only criterion I have found so far is to find the occurrence of country names belonging to some finite set, which of course is bigger than (USA|Canada). For brevity I limit my example just to those two countries.
This regular expression
([a-zA-Z0-9?\-#.]+\s)
already isolates the words making up what I am after, including one space after them. Unfortunately there are cases, where the country after the to-be-extracted street information is only separated by a single space from the country, like e.g. in the first and in the last example.
Since I want to capture the matching parts glued together, I place a + sign behind my regular expression:
([a-zA-Z0-9?\-#.]+\s)+
but then in the two nasty cases with only one separating space before the country, the country is also caught!
Since I know the possible countries from looking at the data, I could try to exclude them by a look ahead-condition like this:
([a-zA-Z0-9?\-#.]+\s)(?!USA|Canada)
which excludes ST. from the match in the first line and STREET from the match in the last line. Of course the single capture groups are not yet glued together by this.
So I would add a plus sign to the group on the left:
([a-zA-Z0-9?\-#.]+\s)+(?!USA|Canada)
But then ST. and STREET and the Country, separated by only a single space, are caught again together with the country, which I want to exclude from my result!
How would you proceed in such a case?
If it would be possible by properly using regular expressions to replace each country name by the same one preceded by an additional space (or even to do this only for cases, where there is only a single space in front of one of the country-names), my problem would be solved. But I want to avoid such a substitution for the whole database in a separate run because a country name might appear in some other column too.
I am quite new to regular expressions and I have no idea how to do two processing steps onto the same input in sequence. - But maybe, someone has a better idea how to cope with this problem.
If I understand correctly, you want all content before the country (excluding spaces before the country). The country will always be present at the end of the line and comes from a list.
So you should be able to set the 'global' and 'multiline' options and then use the following regex:
^(.*?)(?=\s+(USA|Canada)\s*$)
Explanation:
^(.*) match all characters from start of line
(?=\s+(USA|Canada)\s*$) look ahead for one or more spaces, followed by one of the country names, followed by zero or more spaces and end of line.
That should give you a list with all addresses.
Edit:
I have changed the first part to: (.*?), making it non-greedy. That way the match will stop at the last letter before country instead of including some spaces.

Regex pattern in salesforce apex

I am new to regex.
I have a String formatted like below
Street Name
City, StateCode ZipNumber
for example, the string can be like
50 Connecticut Avenue
Norwalk, CT 06850
or
123 6th Avenue
New York, NY 10013
or
4TH Highway 6
Rule, TX 79547
I am trying to construct a regex here.
But cannot proceed as I have a little idea about regex.
Can you please help me?
The following might be enough :
^(?<Street>[^\n]+)\n(?<City>[^,]+), (?<StateCode>[A-Z]{2}) (?<Zip>\d+)$
It captures the following segments in different groups :
the first line in a group named Street
the part of the second line which precedes the comma in a group named City
the next two capital letters in a group named StateCode
the following digits in a group named Zip

SAS Regex code to capture Business Address from 10-K company filings

Consider the following EDGAR 10-K SEC Company Filing
https://www.sec.gov/Archives/edgar/data/912382/000136231009004179/0001362310-09-004179.txt
BUSINESS ADDRESS:
STREET 1: 107 N PENNSYLVANIA ST
STREET 2: STE 600
CITY: INDIANAPOLIS
STATE: IN
ZIP: 46204
BUSINESS PHONE: 3172619000
MAIL ADDRESS:
STREET 1: 107 N PENNSYLVANIA ST
STREET 2: STE 600
CITY: INDIANAPOLIS
STATE: IN
ZIP: 46204
I need a regex in SAS to capture the fields STREET 1, STREET 2, CITY, STATE and ZIP under the Business Address, but not the Mailing Address. For example for STREET 1, I use STREET\s2\s*(.*) in SAS, but it ends up capturing the STREET 1 for Mailing address. Thanks!
This regex should work.
BUSINESS ADDRESS:\s*STREET\s1:\s*(.*)\s*STREET\s2:\s*(.*)
You can continue the pattern to capture each section you need in a new parenthesis. Basically you're just making sure that you get the first answer after business address. The problem with the pattern you were using is that it was able to match the pattern in two separate locations, and the regex engine will only return the last match it finds. Therefore you have to put something in that specifies which one you want.
In SAS you can use the prxposn function with the second argument indicating the capture buffer (parenthesis) to retrieve. For example.
address1=prxposn(regex_pattern, 1, edgar10);
Best.

scala regex to limit with double space

I have a data like below
135 stjosephhrsecschool london DunAve
175865 stbele_higher_secondary sch New York
11 st marys high school for women Paris Louis Avenue
I want to extract id schoolname city area.
Pattern is id(digits) followed by single space then school name. name can have multiple words split by single space or it may have special chars. then minimum of double space or more then city . Again city may have multi words split space or may have special chars. then minimum of 2 spaces or more then its area. Even area follows the same properties as school name & city. But area may or may not present in the line. If its not then i want null value for area.
Here is regex I have tried.
([\d]+) ([\w\s\S]+)\s\s+([\w\s\S]+)\s\s+([\w\s\S]*)
But This regex is not stopping when it see more than 2 spaces. Not sure how to modify this to fit to my data.
all the help are appreciated.
Thanks
If I understand your issue correctly - the issue is that the resulting groups contain trailing spaces (e.g. "Louis Avenue "). If so - you can fix this by using the non-greedy modifiers like +? and *?:
([\d]+) ([\w\s\S]+?)\s\s+([\w\s\S]+?)\s\s+([\w\s\S]*?)?\s*
Which results in what seems to be the desired output:
val s1 = "135 stjosephhrsecschool london DunAve"
val s2 = "175865 stbele_higher_secondary sch New York "
val s3 = "11 st marys high school for women Paris Louis Avenue "
val r = """([\d]+) ([\w\s\S]+?)\s\s+([\w\s\S]+?)\s\s+([\w\s\S]*?)?\s*""".r
def matching(s: String) = s match {
case r(a,b,c,d) => println((a,b,c,d))
case _ => println("no match")
}
matching(s1) // (135,stjosephhrsecschool,london,DunAve)
matching(s2) // (175865,stbele_higher_secondary sch,New York,)
matching(s3) // (11,st marys high school for women,Paris,Louis Avenue)

How to use regex to extract an address with optional street2

I need to extract name, street1, street2, city, state, zip
I have data in this form
JOHN m SMITH [1111 WEST OAK ROAD, SUITE 101, CITY, ST 55555]
GEORGE m JONES [222 MAIN STREET, CITY, ST 55555]
My results for JOHN should be
name="JOHN m SMITH"
street1="1111 WEST OAK ROAD"
street2="SUITE 101"
city = "CITY"
state = "ST"
zip = "55555"
This works with GEORGE's data
Regex r = new Regex(#"^(?<name>.*)\[(?<street>.*)[,]\s(?<city>.*)[,]\s(?<state>.*)\s(?<zip>\d{5})\]$");
var match = r.Match(fullNameAndAddress);
name = match.Groups["name"].Value;
street = match.Groups["street"].Value;
city = match.Groups["city"].Value;
state = match.Groups["state"].Value;
zip = match.Groups["zip"].Value;
How do I add the optional street2?
I want 1 and only 1 "street" group. I thought it should have this: (....){1}?
street2 is optional zero or 1 times. I thought it should have this (...)?
but it doesn't work with JOHN's data, both street1 & street2 are going into the street group:
^(?<name>.*)\[((?<street>.*)[,]\s){1}?((?<street2>.*)[,]\s)?(?<city>.*)[,]\s(?<state>.*)\s(?<zip>\d{5})\]$
Could you clarify what you want stored in street?
Do you want John's to look like '1111 WEST OAK ROAD, SUITE 101'?
Or do you want to stuff it into some variable you wont be using, so that street looks like '1111 WEST OAK ROAD'?
Edit: With clarification, check out this link
http://rubular.com/r/S4HaTMVFZl
What happens here I believe is that the * is greedy, grabbing as much as it can before finding the final occurence of [,]\s
Adding a ? after the .* makes it lazy, grabbing the least information possible.
The amended regex looks like this
^(?<name>.*)\[((?<street>.*?)[,]\s)((?<street2>.*)[,]\s)?(?<city>.*)[,]\s(?<state>.{2})\s(?<zip>\d{5})\]$
You'll notice I changed the Regex for state from .* to .{2}, forcing a 2-character state. Feel free to revert that if you don't want it :)
I made a couple of changes to your regex in rubular.com, and it seemed to be working on both the example strings:
^(?<name>.+)\s\[(?<street>[^,]+),\s((?<street2>[^,]+),\s+)?(?<city>[^,]+),\s(?<state>.+)\s(?<zip>\d{5})\]$
street2 = match.Groups["street2"].Value;
One trick I've learned with regex's is to use the negation of the divider (eg. [^,]* for anything but a comma) instead of .*, so it's impossible to capture multiple fields with one expression. Also, the + operator, which requires at least one match, is useful in most of the groups.
Also, the additional comma is only there if there's an street2 component of the address, which indicates that the comma should be in the same capture group as the street2 part. I added an extra capture group around the street2 capture group to account for this. You can make groups non-capturing in most languages, but it didn't seem necessary.