Regular expressions get a single word out of a common phrase - regex

i have a phrase like this
Computer, Eddie is gone to the market.
I want to get the word Eddie and ignore all of the other words since other words are constant, and the word Eddie could be anything.
How can I do this in regular expression?
Edit:
Sorry I'm using .NET regex :)

You can use this pattern:
Computer, (\w+) is gone to the market\.
This uses brackets to match \w+ and captures it in group 1.
Note that the period at the end has been escaped with a \ because . is a regex metacharacter.
Given the input:
LOL! Computer, Eddie is gone to the market. Blah blah
blah. Computer, Alice is gone to the market... perhaps...
Computer, James Bond is gone to the market.
Then there are two matches (as seen on rubular.com). In the first match, group 1 captured Eddie. In the second match, group 1 captured Alice.
Note that \w+ doesn't match James Bond, because \w+ is a sequence of "one or more word character". If you need to match these kinds non-"single word" names, then simply replace it with the regex to match the names.
References
regular-expressions.info/Capturing Groups and The Dot
General technique
Given this test string:
i have 35 dogs, 16 cats and 10 elephants
Then (\d+) (cats|dogs) yields 2 match results (see on rubular.com)
Result 1: 35 dogs
Group 1 captures 35
Group 2 captures dogs
Result 2: 16 cats
Group 1 captures 16
Group 2 captures cats
Related questions
Saving substrings using Regular Expressions
C# snippet
Here's a simple example of capturing groups usage:
var text = #"
LOL! Computer, Eddie is gone to the market. Blah blah
blah. Computer, Alice is gone to the market... perhaps...
Computer, James Bond is gone to the market.
";
Regex r = new Regex(#"Computer, (\w+) is gone to the market\.");
foreach (Match m in r.Matches(text)) {
Console.WriteLine(m.Groups[1]);
}
The above prints (as seen on ideone.com):
Eddie
Alice
API references
System.Text.RegularExpressions Namespace
On specification
As noted, \w+ does not match "James Bond". It does, however, match "o_o", "giggles2000", etc (as seen on rubular.com). As much as reasonably practical, you should try to make your patterns as specific as possible.
Similarly, (\d+) (cats|dogs) will match 100 cats in $100 catsup (as seen on rubular.com).
These are issues on the patterns themselves, and not directly related to capturing groups.

/^Computer, \b(.+)\b is gone to the market\.$/
Eddie would be in the first captured string $1. If you specify the language, we can tell you how to extract it.
Edit: C#:
Match match = Regex.Match(input, #"^Computer, \b(.+)\b is gone to the market\.$");
Console.WriteLine(match.Groups[1].Value);
Get rid of ^ and $ from the regex if the string would be part of another string - they match start and end of a line respectively.

Related

Regex that matches two or three words, but does no catpure the third if it is a specific word

I need to match a specific pattern but I'm unable to do it with regular expressions. I'm looking for people's name. It follows always the same patterns. Some combinations are:
Mr. Snow
Mr. John Snow
Mr. John Snow (Winterfall of the nord lands)
My problem comes when sometimes I have things like: Mr. Snow and Ms. Stark. It captures also the and. So I'm looking for a regular expression that does not capture the second name only if it is and. Here I'm looking for ["Mr. Snow", "Ms. Stark"].
My best try is as follows:
(M[rs].\s\w+(?:\s[\w-]+)(?:\s\([^\)]*\))?).
Note that the second name is in a non-capturing group. Because I was thinking to use a negative look-ahead, but If I do that, the first word is not captured (because the entire name does not match), and I need that to be captured.
Any Ideas?
Here is some text to fast check.
Here is my two cents:
\bM[rs]\.\h(\p{Lu}\p{Ll}+(?:[\h-]\p{Lu}\p{Ll}+)*)\b
See an online demo
\b - A word-boundary;
M[rs]\.\h - Match Mr. or Ms. followed by a horizontal whitespace;
(\p{Lu}\p{Ll}+(?:[\h-]\p{Lu}\p{Ll}+)*) - A capture group with a nested non-capture group to match an uppercase letter followed by lowercase letters and 0+ 2nd names concatenated through whitespace or hyphen;
\b - A word-boundary.
As it is a name of a person you could also check that the first letters of the words be uppercases.
M[rs].\s[A-Z]\w+(?:\s[A-Z]\w+(?:\s\([^\)]*\))?)?
See the regex demo
Matching names is difficult, see this page for a nice article:
Falsehoods Programmers Believe About Names.
For the examples that you have given, you might use:
\bM[rs]\.(?: (?!M[rs]\.|and )\w+)*
Explanation
\b A word boundary
M[rs]\. Match either Mr or Ms followed by a dot (note to escape it)
(?: Non capture group
Match a space (Or \s+ if you want want to allow newlines)
(?!M[rs]\.|and ) Negative lookahead, assert that from the current position there is not Mr or Ms or and directly to the right
\w+ Match 1+ word characters
)* Close the non capture group and optionally repeat it
Regex demo
This captures the first name in group 1 and the second in group 2if the second name exists and is not and:
(?<=M[rs]\. )(\w+)(?: (?!and)(\w+))?
See live demo.
If you want to capture the title as group 1 and the names as groups 2 and 3, change the look behind to a capture group:
(M[rs]\.) (\w+)(?: (?!and)(\w+))?

Regex to match phone and fax numbers for WebHarvy

Sample text
5950 S Willow Dr Ste 304
Greenwood Village, CO 80111
P (123) 456-7890
F (123) 456-7890
Get Directions
Tried the following but it grabbed the first line of the address as well
(.*)(?=(\n.*){2}$)
Also tried
P\s(\(\d{3})\)\s\d+-\d+
but it doesn't work in WebHarvy even though it works on RegexStorm
Looking for an expression to match the phone and fax numbers from it. I would be using the expression in WebHarvy
https://www.webharvy.com/articles/regex.html
Thanks
Your second pattern is almost what you need to do. With P\s(\(\d{3})\)\s\d+-\d+, you captured into Group 1 only (\(\d{3}) part, while you need to capture the whole number.
I also suggest to restrict the context: either match P as a whole word, or as the first word on a line:
\bP\s*(\(\d{3}\)\s*\d+-\d+)
or
(?m)^\s*P\s*(\(\d{3}\)\s*\d+-\d+)
See the regex demo, and here is what you need to pay attention to there:
The \b part matches a word boundary (\b) and (?m)^\s* matches the start of a line ((?m) makes ^ match the start of a line) and then \s* matches 0+ whitespaces. You may change it to only match horizontal whitespaces by replacing the pattern with [\p{Zs}\t]*.

Regex to extract city names (.NET)

Looking for an expression to extract City Names from addresses. Trying to use this expression in WebHarvy which uses the .NET flavor of regex
Example address
1234 Savoy Dr Ste 123
New Houston, TX 77036-3320
or
1234 Savoy Dr Ste 510
Texas, TX 77036-3320
So the city name could be single or two words.
The expression I am trying is
(\w|\w\s\w)+(?=,\s\w{2})
When I am trying this on RegexStorm it seems to be working fine, but when I am using this in WebHarvy, it only captures the 'n' from the city name New Houston and 'n' from Austin
Where am I going wrong?
In WebHarvey, if a regex contains a capturing group, its contents are returned. Thus, you do not need a lookahead.
Another point is that you need to match 1 or more word chars, optionally followed with a chunk of whitespaces followed with 1 or more word chars. Your regex contains a repeated capturing group whose contents are re-written upon each iteration and after it finds matching, Group 1 only contains n:
Use
(\w+(?:[^\S\r\n]+\w+)?),\s\w{2})
See the regex demo here
The [^\S\r\n]+ part matches any whitespace except CR and LF. You may use [\p{Zs}\t]+ to match any 1+ horizontal whitespaces.

Regex- Ignore a constant string that matches a pattern

I have this regular expression:
\b[A-Z]{1}[A-Z]{0,7}[0-9]?\b|\b[0-9]{2,3}\b
The desired output is as highlighted:
JOHN went to LONDON one fine day.
JOHN had lunch in a PUB.
JOHN then moved to CHICAGO.
I don't want JOHN to be highlighted.
John does not want this to match the pattern.
Neither this.
But THIS1 should match the pattern.
Also the other 70 times that the pattern should match.
Observed output:
JOHN went to LONDON one fine day.
JOHN had lunch in a PUB.
JOHN then moved to CHICAGO.
I don't want JOHN to be highlighted.
John does not want this to match the pattern.
Neither this.
But THIS1 should match the pattern.
Also the other 70 times that the pattern should match.
The regex works partly but I don't want two constant strings- JOHN and I to match as part of this regex. Please help.
You can use a negative lookahead to exclude those matches. Also, your pattern seems rather "redundant", you may shorten it considerably using grouping and removing unnecessary subpatterns:
\b(?!(?:JOHN|I)\b)(?:[A-Z]{1,8}[0-9]?|[0-9]{2,3})\b
^^^^^^^^^^^^^^^^
See the regex demo
The (?!(?:JOHN|I)\b) is the negative lookahead that fails the match if the word matched is equal to I or JOHN.
Note that {1} can always be omitted as any unquantified pattern is matched once. [A-Z]{1}[A-Z]{0,7} is actually equal to [A-Z]{1,8}.
Pattern details:
\b - word boundary
(?!(?:JOHN|I)\b) - the word matched cannot be equal to JOHN or I
(?:[A-Z]{1,8}[0-9]?|[0-9]{2,3}) - one of the two alternatives:
[A-Z]{1,8}[0-9]? - 1 to 8 uppercase ASCII letters followed with an optional (1 or 0) digit
| - or
[0-9]{2,3} - 2 to 3 digits
\b - trailing word boundary

Noob regex poser (match MAY contain and MUST have)

Probably really simple for you Regex masters :) I'm a noob at regex, just having picked up some PHP, but wanting to learn (once this project is complete, I'll knuckle down and crack regular expressions).
I'd like to understand how to compose a regex that may contain some data, but must contain other.
My example being, the match MAY begin with numbers but doesn't have to, however if it does, I need the number and the following 2 words. If it doesn't begin with a number, just the first 2 words. The data will be at the beginning of the string.
The following would match:
123 Fore Street, Fiveways (123 Fore Street returned(no comma))
Our House Village (Our House returned)
7 Eightnine (7 Eightnine returned)
Thanks
Something like this should work:
^((?:\d+\s)?\w+(?:\s\w+)?)
You can test it out somewhere like http://rubular.com/ before coding it, it's usually easier.
What it means:
^ -> beginning of the line
(?:\d+\s)? -> a non capturing group, (marked by ?:), consisting of several digits and a space, since we follow it by ?, it's optional.
\w+(?:\s\w+)? -> several alphanumeric characters (look up what \w means), followed by, optionally, a space and another "word", again in a non capturing group.
The whole thing is encapsulated in a capturing group, so group 1 will contain your match.
Use this regex with multiline option
^(\d+(\s*\b[a-zA-Z]+\b){1,2}|(\s*\b[a-zA-Z]+\b){1,2})
Group1 contains your required data
\d+ means match digit i.e \d 1 to many times+
\s* means match space i.e \s 0 to many times*
(\s*\b[a-zA-Z]+\b){1,2} matches 1 to 2 words..