This is the PCRE2 regexp:
(?<=hello )(?:[^_]\w++)++
It's intended use is against strings like the following:
Hello Bob (Marius) Smith. -> Match "Bob"
Hello Bob Jr. (Joseph) White -> Match "Bob Jr."
Hello Bob Jr. IInd (Paul) Jobs -> Match "Bob Jr. IInd"
You get the point.
Essentially there is a magic word, in this case "hello", followed by a first name, followed by a second name which is always between parens.
First names could be anything really. A single word, a list of words followed by punctuation, and so on. Heck, look at Elon Musks' kids' name (X Æ A-Xii) to see how weird names can get :)
Let's only assume ascii, though. Æ is not in my targets :)
I'm at a loss on how to convert this Regexp to JS, and the only viable solution I found was to use PCRE2-wasm on node which spins up a wasm virtual machine and sucks up 1gb of resources just for that. That's insane.
This would match your cases in ECMAscript.
(?<=[Hh]ello )(?:[^_][\w.]+)+
You need to look for a capital H done by looking for [Hh] instead of h, as your testcases starts with a capital H and your + needs to be single to be used in ECMAscript.
also you need to include a . with the \w since it is included in some names.
https://regex101.com/r/lkZK7w/1
-- thanks "D M" for pointing out the missing . in the testcase.
#Nils has the correct answer.
If you do need to expand your acceptable character set, you can use the following regex. Check it out. The g, m, and i flags are set.
(?<=hello ).*(?=\([^\)]*?\))
Hello Bob (Marius) Smith.
Hello Bob Jr. (Joseph) White
Hello Bob Jr. IInd (Paul) Jobs
Hello X Æ A-Xii (Not Elon) Musk
Hello Bob ()) Jr. ( (Darrell) Black
Match Number
Characters
Matched Text
Match 1
6-10
Bob
Match 2
32-40
Bob Jr.
Match 3
61-74
Bob Jr. IInd
Match 4
92-102
X Æ A-Xii
Match 5
124-138
Bob ()) Jr. (
The idea is pretty simple:
Look behind for your keyword: (?<=hello ).
Look ahead for your middle name: (?=\([^\)]*?\)) (anything inside a set of parenthesis that is not a closing parenthesis, lazily so you don't take part of the first name).
Take everything between as your first name: .*.
The ++ does not work as Javascript does not support possessive quantifiers.
As a first name, followed by a second name which is always between parens, you might also use a capture group with a match instead of a lookbehind.
\b[Hh]ello (\w+.*?)\s*\([^()\s]+\)
\b[Hh]ello Match hello or Hello
( Capture group 1
\w.*? Match 1+ word chars followed by any char as least as possible
) Close group 1
\s*\([^()\s]*\) Match optional whitespace char followed by ( till )
Regex demo
const regex = /\b[Hh]ello (\w+.*?)\s*\([^()\s]+\)/;
["Hello Bob (Marius) Smith.",
"Hello Bob Jr. (Joseph) White",
"Hello Bob Jr. IInd (Paul) Jobs"
].forEach(s => {
const m = s.match(regex);
if (m) {
console.log(m[1]);
}
})
With the lookbehind, you might also match word characters followed by an optionally repeated capture group matching whitspace chars followed by word characters or a dot.
(?<=[Hh]ello )\w+(?:\s+[\w.]+)*
Regex demo
Related
I have like this input address list:
St. Washington, 80
7-th mill B.O., 34
Pr. Lakeview, 17
Pr. Harrison, 15 k.1
St. Hillside Avenue, 26
How I can match only words from this addresses and get like this result:
Washington
mill
Lakeview
Harrison
Hillside Avenue
Pattern (\w+) can't help to me in my case.
It's difficult to know what a "perfect" solution here looks like, as such input might encounter all sorts of unexpected edge cases. However, here's my initial attempt which does at least correctly handle all five examples you have given:
(?<= )[a-zA-Z][a-zA-Z ]*(?=,| )
Demo Link
Explanation:
(?<= ) is a look-behind for a space. I chose this rather than the more standard \b "word boundary" because, for example, you don't want the th in 7-th or the O in B.O. to be counted as a "word".
[a-zA-Z][a-zA-Z ]* is matching letters and spaces only, where the first matched character must be a letter. (You could also equivalently make the regex case-insensitive with the /i option, and just use a-z here.)
(?=,| ) is a look-ahead for a comma or space. Again I chose this rather than the more standard \b "word boundary" because, for example, you don't want the B in B.O. to be counted as a "word".
I have a string that has the following value,
ID Number / 1234
Name: John Doe Smith
Nationality: US
The string will always come with the Name: pre appended.
My regex expression to get the fullname is (?<=Name:\s)(.*) works fine to get the whole name. This (?<=Name:\s)([a-zA-Z]+) seems to get the first name.
So an expression each to get for first,middle & last name would be ideal. Could someone guide me in the right direction?
Thank you
You can capture those into 3 different groups:
(?<=Name:\s)([a-zA-Z]+)\s+([a-zA-Z]+)\s+([a-zA-Z]+)
>>> re.search('(?<=Name:\s)([a-zA-Z]+)\s+([a-zA-Z]+)\s+([a-zA-Z]+)', 'Name: John Doe Smith').groups()
('John', 'Doe', 'Smith')
Or, once you got the full name, you can apply split on the result, and get the names on a list:
>>> re.split(r'\s+', 'John Doe Smith')
['John', 'Doe', 'Smith']
For some reason I assumed Python, but the above can be applied to almost any programming language.
As you stated in the comments that you use .NET you can make use of a quantifier in the lookbehind to select which part of a "word" you want to select after Name:
For example, to get the 3rd part of the name, you can use {2} as the quantifier.
To match non whitespace chars instead of word characters only, you can use \S+ instead of \w+
(?<=\bName:(?:\s+\w+){2}\s+)\w+
(?<= Positive lookbehind, assert that from the current position directly to the left is:
\bName: A word boundary to prevent a partial match, match Name:
(?:\s+\w+){2} Repeat 2 times as a whole, matching 1+ whitespace chars and 1+ word chars. (To get the second name, use {1} or omit the quantifier, to get the first name use {0})
\s+ Match 1+ whitespace chars
) Close lookbehind
\w+ Match 1+ word characters
.NET regex demo
This is the regular expression I was using for this piece of text:
(?![!',:;?\-\d])(\w[A-Za-z']+)
The flavour of regexp is ECMAScript (JavaScript)
The sample text:
This.Sentence.Has.Some.Funky.Stuff.U.S.S.R.Going.On.And.Contains.Some. ABBREVIATIONS.Too.
This.Sentence.Has.Some.Funky.Stuff .U.S.S.R. Going.On.And.Contains.Some. ABBREVIATIONS.Too.
A.S.A.P.?
Ctrl+Alt+Delete
Mr.Smith bought google.com for 1.5 million dollars, i.e. he paid a lot for it. Did he mind? A.d.a.m Jones Jr. thinks he didn't. In any case, this isn't true... Well, with a probability of .9 it isn't. Mr. John Johnson Jr. was born in the U.S.A but earned his Ph.D. in Israel before joining Nike Inc. as an engineer! He also worked at craigslist.org as a b c d e F G H I J business analyst.
It's doing everything I want but I can't also finish the regexp to match the single letters to a b c d e F G H I J where it's [a-zA-Z] in regexp terms.
I don't want the text such as U.S.A to be matched and this is where I'm having trouble.
I've tried the solution here How to include character in regular expression but I couldn't get that to work due to the more complex nature of my issue.
My mission here is to wrap the matching items with anything.
Here's the link for the same regular expression example:
https://regex101.com/r/Qdq4AY/4
A few notes about the pattern you tried
The pattern (?![!',:;?\-\d])(\w[A-Za-z']+) will not match a single character because this part \w[A-Za-z']+ matches at least 2 characters due to the + quantifier
The negative lookahead (?! asserts what is on the right is not any of [!',:;?\-\d] and then matches a word char \w but \w only also matches a digit \d and not the rest.
One option is to match what you don't want to keep the to capture what you want to keep:
\.?[a-zA-Z](?:\.[a-zA-Z])+\.?|\.[a-zA-Z]\.|(?!\d)(\w[A-Za-z']*)
In parts
\.? Match an optional dot
[a-zA-Z](?:\.[a-zA-Z])+\.? Match a single char a-zA-Z followed by repeating 1+ times a dot and a single char and an optional dot
| Or
\.[a-zA-Z]\. Match a char a-zA-Z between 2 dots
| or
(?!\d) Assert what is on the right is not a digit
(\w[A-Za-z']*) Capture in group 1 matching 1+ word char and repeat 0+ times any of the listed in the character class
Regex demo
For example
const regex = /\.?[a-zA-Z](?:\.[a-zA-Z])+\.?|\.[a-zA-Z]\.|(?!\d)(\w[A-Za-z']*)/g;
const str = `This.Sentence.Has.Some.Funky.Stuff.U.S.S.R.Going.On.And.Contains.Some. ABBREVIATIONS.Too.
This.Sentence.Has.Some.Funky.Stuff .U.S.S.R. Going.On.And.Contains.Some. ABBREVIATIONS.Too.
A.S.A.P.?
Ctrl+Alt+Delete
Mr.Smith bought google.com for 1.5 million dollars, i.e. he paid a lot for it. Did he mind? A.d.a.m Jones Jr. thinks he didn't. In any case, this isn't true... Well, with a probability of .9 it isn't. Mr. John Johnson Jr. was born in the U.S.A but earned his Ph.D. in Israel before joining Nike Inc. as an engineer! He also worked at craigslist.org as a b c d e F G H I J business analyst.`;
let m;
while ((m = regex.exec(str)) !== null) {
// This is necessary to avoid infinite loops with zero-width matches
if (m.index === regex.lastIndex) {
regex.lastIndex++;
}
if (undefined !== m[1]) {
console.log(m[1]);
}
}
I have this regular expression:
\b[A-Z]{1}[A-Z]{0,7}[0-9]?\b|\b[0-9]{2,3}\b
The desired output is as highlighted:
JOHN went to LONDON one fine day.
JOHN had lunch in a PUB.
JOHN then moved to CHICAGO.
I don't want JOHN to be highlighted.
John does not want this to match the pattern.
Neither this.
But THIS1 should match the pattern.
Also the other 70 times that the pattern should match.
Observed output:
JOHN went to LONDON one fine day.
JOHN had lunch in a PUB.
JOHN then moved to CHICAGO.
I don't want JOHN to be highlighted.
John does not want this to match the pattern.
Neither this.
But THIS1 should match the pattern.
Also the other 70 times that the pattern should match.
The regex works partly but I don't want two constant strings- JOHN and I to match as part of this regex. Please help.
You can use a negative lookahead to exclude those matches. Also, your pattern seems rather "redundant", you may shorten it considerably using grouping and removing unnecessary subpatterns:
\b(?!(?:JOHN|I)\b)(?:[A-Z]{1,8}[0-9]?|[0-9]{2,3})\b
^^^^^^^^^^^^^^^^
See the regex demo
The (?!(?:JOHN|I)\b) is the negative lookahead that fails the match if the word matched is equal to I or JOHN.
Note that {1} can always be omitted as any unquantified pattern is matched once. [A-Z]{1}[A-Z]{0,7} is actually equal to [A-Z]{1,8}.
Pattern details:
\b - word boundary
(?!(?:JOHN|I)\b) - the word matched cannot be equal to JOHN or I
(?:[A-Z]{1,8}[0-9]?|[0-9]{2,3}) - one of the two alternatives:
[A-Z]{1,8}[0-9]? - 1 to 8 uppercase ASCII letters followed with an optional (1 or 0) digit
| - or
[0-9]{2,3} - 2 to 3 digits
\b - trailing word boundary
i have a phrase like this
Computer, Eddie is gone to the market.
I want to get the word Eddie and ignore all of the other words since other words are constant, and the word Eddie could be anything.
How can I do this in regular expression?
Edit:
Sorry I'm using .NET regex :)
You can use this pattern:
Computer, (\w+) is gone to the market\.
This uses brackets to match \w+ and captures it in group 1.
Note that the period at the end has been escaped with a \ because . is a regex metacharacter.
Given the input:
LOL! Computer, Eddie is gone to the market. Blah blah
blah. Computer, Alice is gone to the market... perhaps...
Computer, James Bond is gone to the market.
Then there are two matches (as seen on rubular.com). In the first match, group 1 captured Eddie. In the second match, group 1 captured Alice.
Note that \w+ doesn't match James Bond, because \w+ is a sequence of "one or more word character". If you need to match these kinds non-"single word" names, then simply replace it with the regex to match the names.
References
regular-expressions.info/Capturing Groups and The Dot
General technique
Given this test string:
i have 35 dogs, 16 cats and 10 elephants
Then (\d+) (cats|dogs) yields 2 match results (see on rubular.com)
Result 1: 35 dogs
Group 1 captures 35
Group 2 captures dogs
Result 2: 16 cats
Group 1 captures 16
Group 2 captures cats
Related questions
Saving substrings using Regular Expressions
C# snippet
Here's a simple example of capturing groups usage:
var text = #"
LOL! Computer, Eddie is gone to the market. Blah blah
blah. Computer, Alice is gone to the market... perhaps...
Computer, James Bond is gone to the market.
";
Regex r = new Regex(#"Computer, (\w+) is gone to the market\.");
foreach (Match m in r.Matches(text)) {
Console.WriteLine(m.Groups[1]);
}
The above prints (as seen on ideone.com):
Eddie
Alice
API references
System.Text.RegularExpressions Namespace
On specification
As noted, \w+ does not match "James Bond". It does, however, match "o_o", "giggles2000", etc (as seen on rubular.com). As much as reasonably practical, you should try to make your patterns as specific as possible.
Similarly, (\d+) (cats|dogs) will match 100 cats in $100 catsup (as seen on rubular.com).
These are issues on the patterns themselves, and not directly related to capturing groups.
/^Computer, \b(.+)\b is gone to the market\.$/
Eddie would be in the first captured string $1. If you specify the language, we can tell you how to extract it.
Edit: C#:
Match match = Regex.Match(input, #"^Computer, \b(.+)\b is gone to the market\.$");
Console.WriteLine(match.Groups[1].Value);
Get rid of ^ and $ from the regex if the string would be part of another string - they match start and end of a line respectively.