Regex to match (extract) only words from address string - regex

I have like this input address list:
St. Washington, 80
7-th mill B.O., 34
Pr. Lakeview, 17
Pr. Harrison, 15 k.1
St. Hillside Avenue, 26
How I can match only words from this addresses and get like this result:
Washington
mill
Lakeview
Harrison
Hillside Avenue
Pattern (\w+) can't help to me in my case.

It's difficult to know what a "perfect" solution here looks like, as such input might encounter all sorts of unexpected edge cases. However, here's my initial attempt which does at least correctly handle all five examples you have given:
(?<= )[a-zA-Z][a-zA-Z ]*(?=,| )
Demo Link
Explanation:
(?<= ) is a look-behind for a space. I chose this rather than the more standard \b "word boundary" because, for example, you don't want the th in 7-th or the O in B.O. to be counted as a "word".
[a-zA-Z][a-zA-Z ]* is matching letters and spaces only, where the first matched character must be a letter. (You could also equivalently make the regex case-insensitive with the /i option, and just use a-z here.)
(?=,| ) is a look-ahead for a comma or space. Again I chose this rather than the more standard \b "word boundary" because, for example, you don't want the B in B.O. to be counted as a "word".

Related

Regex to match phone and fax numbers for WebHarvy

Sample text
5950 S Willow Dr Ste 304
Greenwood Village, CO 80111
P (123) 456-7890
F (123) 456-7890
Get Directions
Tried the following but it grabbed the first line of the address as well
(.*)(?=(\n.*){2}$)
Also tried
P\s(\(\d{3})\)\s\d+-\d+
but it doesn't work in WebHarvy even though it works on RegexStorm
Looking for an expression to match the phone and fax numbers from it. I would be using the expression in WebHarvy
https://www.webharvy.com/articles/regex.html
Thanks
Your second pattern is almost what you need to do. With P\s(\(\d{3})\)\s\d+-\d+, you captured into Group 1 only (\(\d{3}) part, while you need to capture the whole number.
I also suggest to restrict the context: either match P as a whole word, or as the first word on a line:
\bP\s*(\(\d{3}\)\s*\d+-\d+)
or
(?m)^\s*P\s*(\(\d{3}\)\s*\d+-\d+)
See the regex demo, and here is what you need to pay attention to there:
The \b part matches a word boundary (\b) and (?m)^\s* matches the start of a line ((?m) makes ^ match the start of a line) and then \s* matches 0+ whitespaces. You may change it to only match horizontal whitespaces by replacing the pattern with [\p{Zs}\t]*.

Regex- Ignore a constant string that matches a pattern

I have this regular expression:
\b[A-Z]{1}[A-Z]{0,7}[0-9]?\b|\b[0-9]{2,3}\b
The desired output is as highlighted:
JOHN went to LONDON one fine day.
JOHN had lunch in a PUB.
JOHN then moved to CHICAGO.
I don't want JOHN to be highlighted.
John does not want this to match the pattern.
Neither this.
But THIS1 should match the pattern.
Also the other 70 times that the pattern should match.
Observed output:
JOHN went to LONDON one fine day.
JOHN had lunch in a PUB.
JOHN then moved to CHICAGO.
I don't want JOHN to be highlighted.
John does not want this to match the pattern.
Neither this.
But THIS1 should match the pattern.
Also the other 70 times that the pattern should match.
The regex works partly but I don't want two constant strings- JOHN and I to match as part of this regex. Please help.
You can use a negative lookahead to exclude those matches. Also, your pattern seems rather "redundant", you may shorten it considerably using grouping and removing unnecessary subpatterns:
\b(?!(?:JOHN|I)\b)(?:[A-Z]{1,8}[0-9]?|[0-9]{2,3})\b
^^^^^^^^^^^^^^^^
See the regex demo
The (?!(?:JOHN|I)\b) is the negative lookahead that fails the match if the word matched is equal to I or JOHN.
Note that {1} can always be omitted as any unquantified pattern is matched once. [A-Z]{1}[A-Z]{0,7} is actually equal to [A-Z]{1,8}.
Pattern details:
\b - word boundary
(?!(?:JOHN|I)\b) - the word matched cannot be equal to JOHN or I
(?:[A-Z]{1,8}[0-9]?|[0-9]{2,3}) - one of the two alternatives:
[A-Z]{1,8}[0-9]? - 1 to 8 uppercase ASCII letters followed with an optional (1 or 0) digit
| - or
[0-9]{2,3} - 2 to 3 digits
\b - trailing word boundary

Excel 2007 VBA RegEx Help Needed

I'm working on an Excel 2007 VBA project that my client wants done yesterday and I need to use RegEx to locate strings within some pretty challenging data. This is my first exposure to RegEx so I'm stuck doing something I think is simple (maybe not) and I'm clueless.
I've added the reference to the VBScript RegEx engine (5.5) and RegEx is working O.K. in Excel - I just don't know how to construct the pattern statement. I need to locate occurrences of the word "trust" in a range of cells on a worksheet. In some of my data this word has been abbreviated "Tr". I have constructed the following RegEx statement to locate the word "trust" and all words that start with a space and contain "tr".
"trust| tr"
Unfortuantely, this matches any word that contains "tr", like "trail", "tree", and so on. What I want to match is " tr" - meaning it has a leading space, the "tr", and nothing else in the word. Can somebody tell me what I need to do to make this happen?
I'm also going to need RegEx patterns for street addresses, city, state, and zip plus last name and first name. If there's a resource someone can point me to for these expressions I'd appreciate the help. I'm sorry to ask the group this question without spending the proper amount of time educating myself, by this is a time-sensitive project for which I need your expertise.
Thanks In Advance -
PS - Here a sample of data that I'm working with. I have this type of data present in 5 columns over 4,000 rows.
Jones Family **Trust**
3420 E Ave of the Ftns
3420 E Avenue of the Fountain
320 E ARROWHEAD **TRAILHEAD**
501 S 29TH ST
PO BOX 13422
71343 W Paradise Dr
152035 S 29TH ST
124 Owl Grove Pl
Johnson **Tr**
1900 E Arrowhead **Trl**
1900 E ARROWHEAD **TRL**
This is a sample from a column that predominantly contains street addresses. Other columns contain client names without addresses. So not every cell contains data that starts with a number.
I would rewrite your expression that finds trust and tr where they not preceded or followed by a other letters by using the \b is a word boundary assertion. \b matches at a position that is aptly called a "word boundary".
There are three different positions that qualify as word boundaries:
Before the first character in the string, if the first character is a
word character.
After the last character in the string, if the last
character is a word character.
Between two characters in the string,
where one is a word character and the other is not a word character.
For more information on word boundaries then see also regular-expressions.info. I'm not affiliated with that site.
\b(?:trust|tr)\b
After viewing the above, if you're still set on requiring the tr preceded by a space, then use this \b(?:trust|\str)\b
Examples
Live Demo
https://regex101.com/r/xM4fR9/1
Note: I am assuming you're using the case insensitive flag for this
Explanation
NODE EXPLANATION
----------------------------------------------------------------------
\b the boundary between a word char (\w) and
something that is not a word char
----------------------------------------------------------------------
(?: group, but do not capture:
----------------------------------------------------------------------
trust 'trust'
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
tr 'tr'
----------------------------------------------------------------------
) end of grouping
----------------------------------------------------------------------
\b the boundary between a word char (\w) and
something that is not a word char
----------------------------------------------------------------------
Or
The \b(?:trust|tr)\b expression isn't the most efficient, but it is readable.
A functionally identical, but more efficient regular expression would be:
\btr(?:ust)?\b
Here we're still using the \b word boundary, but we've just made the ust part of the word trust optional with the (?: ... )? construct.

Regex Find Name in Fixed Length Records

I have this flat file and I want to make sure I only display records where the first name is Andrew
00012 Andrew Carter
02349 John Smith
20089 Charlotte Andrew
Each line contains, in order, three fields: five-digit employee number, first name, and last name. Each field is delimited by a space.
I think you want to find all lines that match the following pattern:
\d+\sAndrew.+
Or, as #Sam Sullivan points out, you could also specify the number of digits:
\d{5}\sAndrew.+
If you have set your regex options to allow the dot to match newline characters, you could use [^\n]+ instead of the final .+, as #Sam Sullivan also points out. But as #Casimir et Hippolyte notes, by default the dot will not match newline characters.
([0-9]{5})\s(Andrew)\s([A-Za-z\s.,-]{1,})
Each parenthesis will capture the three pieces of information.
00012, Andrew, Carter
This requires 5 numeric, a space, the name Andrew (case sensitive), another space, and then whatever the last name is, including surnames and suffixes.
The third capture is looking for a capital letter a through z, or a lower case letter a-z, a space, a period, comma, or dash
So "Carter-Smith, Jr. M.D." is a valid last name.
First off, this is a case where tools can be your friend.
Check out
expresso: "http://www.ultrapico.com/expresso.htm"
It's great for design and testing of regexes
Also there is
RegexCoach: http://www.weitz.de/regex-coach/
Which actually lets you step RegExs like normal code
On to your question though:
Begin line: ^
Five digits: \d\d\d\d\d
Space: \w
Name: Andrew
Space: \w
LastName: [A-Za-z]+
End: $
So:
^\d\d\d\d\d\wAndrew\w[A-Za-z]+$
Disclaimer - not tested, but pretty confident :)

How to extract internal words using regex

I am trying to match only the street name from a series of addresses. The addresses might look like:
23 Barrel Rd.
14 Old Mill Dr.
65-345 Howard's Bluff
I want to use a regex to match "Barrel", "Old Mill", and "Howard's". I need to figure out how to exclude the last word. So far I have a lookbehind to exclude the digits, and I can include the words and spaces and "'" by using this:
(?<=\d\s)(\w|\s|\')+
How can I exclude the final word (which may or may not end in a period)? I figure I should be using a lookahead, but I can't figure out how to formulate it.
You don't need a look-behind for this:
/^[-\d]+ ([\w ']+) \w+\.?$/
Match one or more digits and hyphens
space
match letters, digits, spaces, apostrophes into capture group 1
space
match a final word and an optional period
An example Ruby implementation:
regex = /^[-\d]+ ([\w ']+) \w+\.?$/
tests = [ "23 Barrel Rd.", "14 Old Mill Dr.", "65-345 Howard's Bluff" ]
tests.each do |test|
p test.match(regex)[1]
end
Output:
"Barrel"
"Old Mill"
"Howard's"
I believe the lookahead you want is (?=\s\w+\.?$).
\s: you don't want to include the last space
\w: at least one word-character (A-Z, a-z, 0-9, or '_')
\.?: optional period (for abbreviations such as "St.")
$: make sure this is the last word
If there's a possibility that there might be additional whitespace before the newline, just change this to (?=\s\w+\.?\s*$).
Why not just match what you want? If I have understood well you need to get all the words after the numbers excluding the last word. Words are separated by space so just get everything between numbers and the last space.
Example
\d+(?:-\d+)? ((?:.)+)  Note: there's a space at the end.
Tha will end up with what you want in \1 N times.
If you just want to match the exact text you may use \K (not supported by every regex engine) but: Example
With the regex \d+(?:-\d+)? \K.+(?= )
Another option is to use the split() function provided in most scripting languages. Here's the Python version of what you want:
stname = address.split()[1:-1]
(Here address is the original address line, and stname is the name of the street, i.e., what you're trying to extract.)