Regex - find string by excluding part of it

Regex - find string by excluding part of it - regex

I have text: "Johnny Alan Walker Sint Jansstraat 7, 1012 HG Amsterdam +123456789012"
Is is possible to find Lastname and phone?
Exclude address?
Address regex is this: "([A-Z]{1,}[a-z]{1,}\s){2}[0-9]{0,4}\,\s{1,}[0-9]{4}\s[A-Z]{2}\s{1,}[a-zA-Z]{1,}" (two words from capital, housenumber, comma, postal code and city)
I want result string to be "Walker +123456789012"

This should do what you need, and also doesn't assume three names (works without a middle name present), so it's a little more flexible in case you run into entries for people who don't have a middle name:
.*?(\w+)\s*(?:[A-Z]{1,}[a-z]{1,}\s){2}[0-9]{0,4}\,\s{1,}[0-9]{4}\s[A-Z]{2}\s{1,}[a-zA-Z]{1,}\s*(\+\d+)
.*?(\w+)\s* - Capture the last word before the whitespace before the address. .*? will lazily match anything up to the word preceeding the address, but not capture. \s* will match the whitespace between the word and the address.
(?:[A-Z]{1,}[a-z]{1,}\s){2}[0-9]{0,4}\,\s{1,}[0-9]{4}\s[A-Z]{2}\s{1,}[a-zA-Z]{1,} - your address regex but using a non-capturing group (?:)
\s*(\+\d+) - Captures the + and following numbers. \s* will match the whitespace between the address and the +.
I reused your address regex, but made the capture group non-capturing. Then we match the last word before the address (the last name) using (\w+), and the + and following numbers after the address using (\+\d+).
Here it is in action: https://regex101.com/r/YGiaJT/1

You could do....
\w+\s+\w+\s+(\w+).*(\+\d+)
And your capture groups should match up pretty well with what you're trying to match...
Essentially this will "disregard" your first and second "words" (first / middle name) and then disregard EVERYTHING from in between until it finds a + then captures the digits after it.
Live example: https://regex101.com/r/MjJCSv/1
In theory if your last name and your address will always be separated by more than 1 space you can shorten this a little bit and write it as
(\w+)\s{2,}.*(\+\d+)
Live example of this functionality: https://regex101.com/r/vGGB4z/1
Example implementation of the later in java: http://ideone.com/RExAEO

You can use the following to capture just the surname and the phone number.
The first part ((\w+\s){3}) will capture the 3rd occurrence of a word followed by a space.
The second part (.+?) will capture everything
The third part ((\+?\d+)$) will capture an optional + (phone number prefix) and the rest of the phone number, up to the end of the string.
(\w+\s){3}.+?(\+?\d+)$
\1 - The surname
\2 - The phone number
https://regex101.com/r/gqu0tt/4
But, IF the surname and the address is separated with more than 1 space, then you can use
(\w+)\s{2,}.+?(\+?\d+)$
\1 - The surname
\2 - The phone number
https://regex101.com/r/gqu0tt/5
I've tested these expressions on the Java engine, and they give back the correct match

Related

Regex that matches two or three words, but does no catpure the third if it is a specific word

I need to match a specific pattern but I'm unable to do it with regular expressions. I'm looking for people's name. It follows always the same patterns. Some combinations are:
Mr. Snow
Mr. John Snow
Mr. John Snow (Winterfall of the nord lands)
My problem comes when sometimes I have things like: Mr. Snow and Ms. Stark. It captures also the and. So I'm looking for a regular expression that does not capture the second name only if it is and. Here I'm looking for ["Mr. Snow", "Ms. Stark"].
My best try is as follows:
(M[rs].\s\w+(?:\s[\w-]+)(?:\s\([^\)]*\))?).
Note that the second name is in a non-capturing group. Because I was thinking to use a negative look-ahead, but If I do that, the first word is not captured (because the entire name does not match), and I need that to be captured.
Any Ideas?
Here is some text to fast check.

Here is my two cents:
\bM[rs]\.\h(\p{Lu}\p{Ll}+(?:[\h-]\p{Lu}\p{Ll}+)*)\b
See an online demo
\b - A word-boundary;
M[rs]\.\h - Match Mr. or Ms. followed by a horizontal whitespace;
(\p{Lu}\p{Ll}+(?:[\h-]\p{Lu}\p{Ll}+)*) - A capture group with a nested non-capture group to match an uppercase letter followed by lowercase letters and 0+ 2nd names concatenated through whitespace or hyphen;
\b - A word-boundary.

As it is a name of a person you could also check that the first letters of the words be uppercases.
M[rs].\s[A-Z]\w+(?:\s[A-Z]\w+(?:\s\([^\)]*\))?)?
See the regex demo

Matching names is difficult, see this page for a nice article:
Falsehoods Programmers Believe About Names.
For the examples that you have given, you might use:
\bM[rs]\.(?: (?!M[rs]\.|and )\w+)*
Explanation
\b A word boundary
M[rs]\. Match either Mr or Ms followed by a dot (note to escape it)
(?: Non capture group
Match a space (Or \s+ if you want want to allow newlines)
(?!M[rs]\.|and ) Negative lookahead, assert that from the current position there is not Mr or Ms or and directly to the right
\w+ Match 1+ word characters
)* Close the non capture group and optionally repeat it
Regex demo

This captures the first name in group 1 and the second in group 2if the second name exists and is not and:
(?<=M[rs]\. )(\w+)(?: (?!and)(\w+))?
See live demo.
If you want to capture the title as group 1 and the names as groups 2 and 3, change the look behind to a capture group:
(M[rs]\.) (\w+)(?: (?!and)(\w+))?

regex to split string into parts

I have a string that has the following value,
ID Number / 1234
Name: John Doe Smith
Nationality: US
The string will always come with the Name: pre appended.
My regex expression to get the fullname is (?<=Name:\s)(.*) works fine to get the whole name. This (?<=Name:\s)([a-zA-Z]+) seems to get the first name.
So an expression each to get for first,middle & last name would be ideal. Could someone guide me in the right direction?
Thank you

You can capture those into 3 different groups:
(?<=Name:\s)([a-zA-Z]+)\s+([a-zA-Z]+)\s+([a-zA-Z]+)
>>> re.search('(?<=Name:\s)([a-zA-Z]+)\s+([a-zA-Z]+)\s+([a-zA-Z]+)', 'Name: John Doe Smith').groups()
('John', 'Doe', 'Smith')
Or, once you got the full name, you can apply split on the result, and get the names on a list:
>>> re.split(r'\s+', 'John Doe Smith')
['John', 'Doe', 'Smith']
For some reason I assumed Python, but the above can be applied to almost any programming language.

As you stated in the comments that you use .NET you can make use of a quantifier in the lookbehind to select which part of a "word" you want to select after Name:
For example, to get the 3rd part of the name, you can use {2} as the quantifier.
To match non whitespace chars instead of word characters only, you can use \S+ instead of \w+
(?<=\bName:(?:\s+\w+){2}\s+)\w+
(?<= Positive lookbehind, assert that from the current position directly to the left is:
\bName: A word boundary to prevent a partial match, match Name:
(?:\s+\w+){2} Repeat 2 times as a whole, matching 1+ whitespace chars and 1+ word chars. (To get the second name, use {1} or omit the quantifier, to get the first name use {0})
\s+ Match 1+ whitespace chars
) Close lookbehind
\w+ Match 1+ word characters
.NET regex demo

Regex capturing the first occurrence of every group in a recurring pattern

Suppose I have the following text:
Name: John Doe\tAddress: Street 123 ABC\tCity: MyCity
I have a regex (a bit more complex, but it boils down to this):
^(?:(?:(?:Name: (.+?))|(?:Address: (.+?))|(?:City: (.+?)))\t*)+$
which has three capturing groups, that can capture the values of Name, Address and City (if they occur in the text). A few more examples are here: https://regex101.com/r/37nemH/6. EDIT The ordering is not fixed beforehand, and it could also happen that the fields are not separated by \t characters.
Now this all works well, the only slight problem I have is when one field occurs twice in the same text, as can be seen in the last example I put on regex101:
Name: John Doe\tAddress: Street 123 ABC\tCity: MyCity\tAddress: Other Address
What I would want is for the second capturing group to match the first address, i.e. Street 123 ABC, and preferably to let the second occurrence be matched within the "City" group, i.e.
1: John Doe
2: Street 123 ABC
3: MyCity\tAddress: Other Address
Conceptually, I tried doing this with a negative lookbehind, e.g. replacing (?:Address: (.+?)) with (?:(?<!.*Address: )Address: (.+?)), i.e. assuring that an Address: match was not proceded somewhere in the text by another Address: tag. But, negative lookbehind does not allow for arbitrary length, so this obviously would not work.
Can this be achieved using regex, and how?

For your stated problem, you may use this regex with a conditional construct:
^.*?(?:(?:Name: (.+?)|(Address: )(.+?)|City: ((?(2).*?Address: )*.+?))\t*)+$
RegEx Demo
Your values are available in captured groups 1, 3, 4.
Capture group 2 is for literal label "Address: ".
Here, (?(2).*?Address: )* is a conditional construct that means if captured group 2 is present then in group 4 match text till next Address: is found (0 or more matches of this).
For the text Name: John Doe Address: Street 123 ABC City: MyCity Address: Second address, it will have following matches:
Group 1. 169-177 `John Doe`
Group 2. 178-187 `Address: `
Group 3. 187-201 `Street 123 ABC`
Group 4. 210-240 `MyCity Address: Second address`

If the word order can be any and some or all the items can be missing, it is much easier to use 3 separate patterns to extract the bits you need.
Name (demo):
^.*?Name:\s*(.*?)(?=\s*(?:Name:|Address:|City:|$))
City (demo):
^.*?City:\s*(.*?)(?=\s*(?:Name:|Address:|City:|$))
Address (demo):
^.*?Address:\s*(.*?)(?=\s*(?:Name:|Address:|City:|$))
Details
^ - start of string
.*? - any 0+ chars other than line break chars, as few as possible
Address: - a keyword to stop at and look for the expected match
\s* - 0+ whitespaces
(.*?) - Group 1: any 0+ chars other than line break chars, as few as possible...
(?=\s*(?:Name:|Address:|City:|$)) - up to but excluding 0 or more whitespaces followed with Name:, Address:, City: or end of string.

Regex to extract city names (.NET)

Looking for an expression to extract City Names from addresses. Trying to use this expression in WebHarvy which uses the .NET flavor of regex
Example address
1234 Savoy Dr Ste 123
New Houston, TX 77036-3320
or
1234 Savoy Dr Ste 510
Texas, TX 77036-3320
So the city name could be single or two words.
The expression I am trying is
(\w|\w\s\w)+(?=,\s\w{2})
When I am trying this on RegexStorm it seems to be working fine, but when I am using this in WebHarvy, it only captures the 'n' from the city name New Houston and 'n' from Austin
Where am I going wrong?

In WebHarvey, if a regex contains a capturing group, its contents are returned. Thus, you do not need a lookahead.
Another point is that you need to match 1 or more word chars, optionally followed with a chunk of whitespaces followed with 1 or more word chars. Your regex contains a repeated capturing group whose contents are re-written upon each iteration and after it finds matching, Group 1 only contains n:
Use
(\w+(?:[^\S\r\n]+\w+)?),\s\w{2})
See the regex demo here
The [^\S\r\n]+ part matches any whitespace except CR and LF. You may use [\p{Zs}\t]+ to match any 1+ horizontal whitespaces.

How to extract internal words using regex

I am trying to match only the street name from a series of addresses. The addresses might look like:
23 Barrel Rd.
14 Old Mill Dr.
65-345 Howard's Bluff
I want to use a regex to match "Barrel", "Old Mill", and "Howard's". I need to figure out how to exclude the last word. So far I have a lookbehind to exclude the digits, and I can include the words and spaces and "'" by using this:
(?<=\d\s)(\w|\s|\')+
How can I exclude the final word (which may or may not end in a period)? I figure I should be using a lookahead, but I can't figure out how to formulate it.

You don't need a look-behind for this:
/^[-\d]+ ([\w ']+) \w+\.?$/
Match one or more digits and hyphens
space
match letters, digits, spaces, apostrophes into capture group 1
space
match a final word and an optional period
An example Ruby implementation:
regex = /^[-\d]+ ([\w ']+) \w+\.?$/
tests = [ "23 Barrel Rd.", "14 Old Mill Dr.", "65-345 Howard's Bluff" ]
tests.each do |test|
p test.match(regex)[1]
end
Output:
"Barrel"
"Old Mill"
"Howard's"

I believe the lookahead you want is (?=\s\w+\.?$).
\s: you don't want to include the last space
\w: at least one word-character (A-Z, a-z, 0-9, or '_')
\.?: optional period (for abbreviations such as "St.")
$: make sure this is the last word
If there's a possibility that there might be additional whitespace before the newline, just change this to (?=\s\w+\.?\s*$).

Why not just match what you want? If I have understood well you need to get all the words after the numbers excluding the last word. Words are separated by space so just get everything between numbers and the last space.
Example
\d+(?:-\d+)? ((?:.)+)  Note: there's a space at the end.
Tha will end up with what you want in \1 N times.
If you just want to match the exact text you may use \K (not supported by every regex engine) but: Example
With the regex \d+(?:-\d+)? \K.+(?= )

Another option is to use the split() function provided in most scripting languages. Here's the Python version of what you want:
stname = address.split()[1:-1]
(Here address is the original address line, and stname is the name of the street, i.e., what you're trying to extract.)

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Regex - find string by excluding part of it - regex

Related

Regex that matches two or three words, but does no catpure the third if it is a specific word

regex to split string into parts

Regex capturing the first occurrence of every group in a recurring pattern

Regex to extract city names (.NET)

How to extract internal words using regex

Categories

Resources