Regular expression - How to include a single character also in this regexp? - regex

This is the regular expression I was using for this piece of text:
(?![!',:;?\-\d])(\w[A-Za-z']+)
The flavour of regexp is ECMAScript (JavaScript)
The sample text:
This.Sentence.Has.Some.Funky.Stuff.U.S.S.R.Going.On.And.Contains.Some. ABBREVIATIONS.Too.
This.Sentence.Has.Some.Funky.Stuff .U.S.S.R. Going.On.And.Contains.Some. ABBREVIATIONS.Too.
A.S.A.P.?
Ctrl+Alt+Delete
Mr.Smith bought google.com for 1.5 million dollars, i.e. he paid a lot for it. Did he mind? A.d.a.m Jones Jr. thinks he didn't. In any case, this isn't true... Well, with a probability of .9 it isn't. Mr. John Johnson Jr. was born in the U.S.A but earned his Ph.D. in Israel before joining Nike Inc. as an engineer! He also worked at craigslist.org as a b c d e F G H I J business analyst.
It's doing everything I want but I can't also finish the regexp to match the single letters to a b c d e F G H I J where it's [a-zA-Z] in regexp terms.
I don't want the text such as U.S.A to be matched and this is where I'm having trouble.
I've tried the solution here How to include character in regular expression but I couldn't get that to work due to the more complex nature of my issue.
My mission here is to wrap the matching items with anything.
Here's the link for the same regular expression example:
https://regex101.com/r/Qdq4AY/4

A few notes about the pattern you tried
The pattern (?![!',:;?\-\d])(\w[A-Za-z']+) will not match a single character because this part \w[A-Za-z']+ matches at least 2 characters due to the + quantifier
The negative lookahead (?! asserts what is on the right is not any of [!',:;?\-\d] and then matches a word char \w but \w only also matches a digit \d and not the rest.
One option is to match what you don't want to keep the to capture what you want to keep:
\.?[a-zA-Z](?:\.[a-zA-Z])+\.?|\.[a-zA-Z]\.|(?!\d)(\w[A-Za-z']*)
In parts
\.? Match an optional dot
[a-zA-Z](?:\.[a-zA-Z])+\.? Match a single char a-zA-Z followed by repeating 1+ times a dot and a single char and an optional dot
| Or
\.[a-zA-Z]\. Match a char a-zA-Z between 2 dots
| or
(?!\d) Assert what is on the right is not a digit
(\w[A-Za-z']*) Capture in group 1 matching 1+ word char and repeat 0+ times any of the listed in the character class
Regex demo
For example
const regex = /\.?[a-zA-Z](?:\.[a-zA-Z])+\.?|\.[a-zA-Z]\.|(?!\d)(\w[A-Za-z']*)/g;
const str = `This.Sentence.Has.Some.Funky.Stuff.U.S.S.R.Going.On.And.Contains.Some. ABBREVIATIONS.Too.
This.Sentence.Has.Some.Funky.Stuff .U.S.S.R. Going.On.And.Contains.Some. ABBREVIATIONS.Too.
A.S.A.P.?
Ctrl+Alt+Delete
Mr.Smith bought google.com for 1.5 million dollars, i.e. he paid a lot for it. Did he mind? A.d.a.m Jones Jr. thinks he didn't. In any case, this isn't true... Well, with a probability of .9 it isn't. Mr. John Johnson Jr. was born in the U.S.A but earned his Ph.D. in Israel before joining Nike Inc. as an engineer! He also worked at craigslist.org as a b c d e F G H I J business analyst.`;
let m;
while ((m = regex.exec(str)) !== null) {
// This is necessary to avoid infinite loops with zero-width matches
if (m.index === regex.lastIndex) {
regex.lastIndex++;
}
if (undefined !== m[1]) {
console.log(m[1]);
}
}

Related

How to convert a PCRE2 regex to JavaScript?

This is the PCRE2 regexp:
(?<=hello )(?:[^_]\w++)++
It's intended use is against strings like the following:
Hello Bob (Marius) Smith. -> Match "Bob"
Hello Bob Jr. (Joseph) White -> Match "Bob Jr."
Hello Bob Jr. IInd (Paul) Jobs -> Match "Bob Jr. IInd"
You get the point.
Essentially there is a magic word, in this case "hello", followed by a first name, followed by a second name which is always between parens.
First names could be anything really. A single word, a list of words followed by punctuation, and so on. Heck, look at Elon Musks' kids' name (X Æ A-Xii) to see how weird names can get :)
Let's only assume ascii, though. Æ is not in my targets :)
I'm at a loss on how to convert this Regexp to JS, and the only viable solution I found was to use PCRE2-wasm on node which spins up a wasm virtual machine and sucks up 1gb of resources just for that. That's insane.
This would match your cases in ECMAscript.
(?<=[Hh]ello )(?:[^_][\w.]+)+
You need to look for a capital H done by looking for [Hh] instead of h, as your testcases starts with a capital H and your + needs to be single to be used in ECMAscript.
also you need to include a . with the \w since it is included in some names.
https://regex101.com/r/lkZK7w/1
-- thanks "D M" for pointing out the missing . in the testcase.
#Nils has the correct answer.
If you do need to expand your acceptable character set, you can use the following regex. Check it out. The g, m, and i flags are set.
(?<=hello ).*(?=\([^\)]*?\))
Hello Bob (Marius) Smith.
Hello Bob Jr. (Joseph) White
Hello Bob Jr. IInd (Paul) Jobs
Hello X Æ A-Xii (Not Elon) Musk
Hello Bob ()) Jr. ( (Darrell) Black
Match Number
Characters
Matched Text
Match 1
6-10
Bob
Match 2
32-40
Bob Jr.
Match 3
61-74
Bob Jr. IInd
Match 4
92-102
X Æ A-Xii
Match 5
124-138
Bob ()) Jr. (
The idea is pretty simple:
Look behind for your keyword: (?<=hello ).
Look ahead for your middle name: (?=\([^\)]*?\)) (anything inside a set of parenthesis that is not a closing parenthesis, lazily so you don't take part of the first name).
Take everything between as your first name: .*.
The ++ does not work as Javascript does not support possessive quantifiers.
As a first name, followed by a second name which is always between parens, you might also use a capture group with a match instead of a lookbehind.
\b[Hh]ello (\w+.*?)\s*\([^()\s]+\)
\b[Hh]ello Match hello or Hello
( Capture group 1
\w.*? Match 1+ word chars followed by any char as least as possible
) Close group 1
\s*\([^()\s]*\) Match optional whitespace char followed by ( till )
Regex demo
const regex = /\b[Hh]ello (\w+.*?)\s*\([^()\s]+\)/;
["Hello Bob (Marius) Smith.",
"Hello Bob Jr. (Joseph) White",
"Hello Bob Jr. IInd (Paul) Jobs"
].forEach(s => {
const m = s.match(regex);
if (m) {
console.log(m[1]);
}
})
With the lookbehind, you might also match word characters followed by an optionally repeated capture group matching whitspace chars followed by word characters or a dot.
(?<=[Hh]ello )\w+(?:\s+[\w.]+)*
Regex demo

Match pattern multiple lines Integromat

I want to match specific patterns in multiple lines with the match pattern in Integromat. The language is ECMAScript (JavaScript) FLAVOR.
Salutation: Mr. x Mx. Mrs.
or it could look like this:
Salutation: Mr. Mx. x Mrs.
I want to get the String after x\s, to extract either Mr. Mx. or Mrs..
Currently, I am at this point, but it only matches if the x is before Mr.
Salutation:\s(x\s(.*?)[\s])
How do I need to change it? Thanks in advance!
You might use a capture group with an alternation to match either a single occurrence of Mr. Mrs. or Mx.
\bSalutation:.*?\sx\s(M(?:rs?|x)\.)
The pattern matches:
\bSalutation: Match literally
.*?\sx\s Match any char as least as possible till the first occurrence of x between whitespace chars
( Capture group 1 (in the example referred to as m[1])
M(?:rs?|x)\. Match M followed by either r with optional s or x and then a dot
) Close group 1
Regex demo
const regex = /\bSalutation:.*?\sx\s(M(?:rs?|x)\.)/;
[
"Salutation: Mr. x Mx. Mrs.",
"Salutation: Mr. Mx. x Mrs.",
"Salutation: Mr. Mx. x Mr.",
].forEach(s => {
const m = s.match(regex);
if (m) {
console.log(m[1]);
}
});
If you want to match all of the occurrences after the x, and a lookbehind is supported in the Javascript version:
(?<=\bSalutation:.*\sx\s.*)\bM(?:rs?|x)\.
Regex demo

Conditional regex in Ruby

I've got the following string:
'USD 100'
Based on this post I'm trying to capture 100 if USD is contained in the string or the individual (currency) characters if USD is not contained in the string.
For example:
'USD 100' # => '100'
'YEN 300' # => ['Y', 'E', 'N']
So far I've got up to this but it's not working:
https://rubular.com/r/cK8Hn2mzrheHXZ
Interestingly if I place the USD after the amount it seems to work. Ideally I'd like to have the same behaviour regardless of the position of the currency characters.
Your regex (?=.*(USD))(?(1)\d+|[a-zA-Z]) does not work because
(?=.*(USD)) - a positive lookahead, triggered at every location inside a string (if scan is used) that matches USD substring after any 0 or more chars other than line break chars as many as possible (it means, there will only be a match if there is USD somewhere on a line)
(?(1)\d+|[a-zA-Z]) - a conditional construct that matches 1+ digits if Group 1 matched (if there is USD), or, an ASCII letter will be tried. However, the second alternative pattern will never be tried, because you required USD to be present in the string for a match to occur.
Look at the USD 100 regex debugger, it shows exactly what happens when the (?=.*(USD))(?(1)\d+|[a-zA-Z]) regex tries to find a match:
Step 1 to 22: The lookahead pattern is tried first. The point here is that the match will fail immediately if the positive lookahead pattern does not find a match. In this case, USD is found at the start of the string (since the first time the pattern is tried, the regex index is at the string start position). The lookahead found a match.
Step 23-25: since a lookahead is a non-consuming pattern, the regex index is still at the string start position. The lookahead says "go-ahead", and the conditional construct is entered. (?(1) condition is met, Group 1, USD, was matched. So, the first, then, part is triggered. \d+ does not find any digits, since there is U letter at the start. The regex match fails at the string start position, but there are more positions in the string to test since there is no \A nor ^ anchor that would only let a match to occur if the match is found at the start of the string/line.
Step 26: The regex engine index is advanced one char to the right, now, it is right before the letter S.
Step 27-40: The regex engine wants to find 0+ chars and then USD immediately to the right of the current location, but fails (U is already "behind" the index).
Then, the execution is just the same as described above: the regex fails to match USD anywhere to the right of the current location and eventually fails.
If the USD is somewhere to the right of 100, then you'd get a match.
So, the lookahead does not set any search range, it simply allows matching the rest of the patterns (if its pattern matches) or not (if its pattern is not found).
You may use
.scan(/^USD.*?\K(\d+)|([a-zA-Z])/).flatten.compact
Pattern details
^USD.*?\K(\d+) - either USD at the start of the string, then any 0 or more chars other than line break chars as few as possible, and then the text matched is dropped and 1+ digits are captured into Group 1
| - or
([a-zA-Z]) - any ASCII letter captured into Group 2.
See Ruby demo:
p "USD 100".scan(/^USD.*?\K(\d+)|([a-zA-Z])/).flatten.compact
# => ["100"]
p "YEN 100".scan(/^USD.*?\K(\d+)|([a-zA-Z])/).flatten.compact
# => ["Y", "E", "N"]
Anatomy of your pattern
(?=.*(USD))(?(1)\d+|[a-zA-Z])
| | | | | |_______
| | | | | Else match a single char a-zA-Z
| | | | |
| | | | |__
| | | | If group 1 exists, match 1+ digits
| | | |
| | | |__
| | | Test for group 1
| | |_________________
| | If Clause
| |___
| Capture group 1
|__________
Positive lookahead
About the pattern you tried
The positive lookahead is not anchored and will be tried on each position. It will continue the match if it returns true, else the match stops and the engine will move to the next position.
Why does the pattern not match?
On the first position the lookahead is true as it can find USD on the right.
It tries to match 1+ digits, but the first char is U which it can not match.
USD 100
⎸
First position
From the second position till the end, the lookahead is false because it can not find USD on the right.
USD 100
⎸
Second position
Eventually, the if clause is only tried once, where it could not match 1+ digits. The else clause is never tried and overall there is no match.
For the YEN 300 part, the if clause is never tried as the lookahead will never find USD at the right and overall there is no match.
Interesting resources about conditionals can be for example found at rexegg.com and regular-expressions.info
If you want the separate matches, you might use:
\bUSD \K\d+|[A-Z](?=[A-Z]* \d+\b)
Explanation
\bUSD Match USD and a space
\K\d+ Forget what is matched using \K and match 1+ digits
| Or
[A-Z] Match a char A-Z
(?=[A-Z]* \d+\b) Assert what is on the right is optional chars A-Z and 1+ digits
regex demo
Or using capturing groups:
\bUSD \K(\d+)|([A-Z])(?=[A-Z]* \d+\b)
Regex demo
The following pattern seems to work:
\b(?:USD (\d+)|(?!USD\b)(\w+) \d+)\b
This works with caveat that it just has a single capture group for the non USD currency symbol. One part of the regex might merit explanation:
(?!USD\b)(\w+)
This uses a negative lookahead to assert that the currency symbol is not USD. If so, then it captures that currency symbol.
I suggest the information desired be extracted as follows.
R = /\b([A-Z]{3}) +(\d+)\b/
def doit(str)
str.scan(R).each_with_object({}) do |(cc,val),h|
h[cc] = (cc == 'USD') ? val : cc.split('')
end
end
doit 'USD 100'
#=> {"USD"=>"100"}
doit 'YEN 300'
#=> {"YEN"=>["Y", "E", "N"]}
doit 'I had USD 6000 to spend'
#=> {"USD"=>"6000"}
doit 'I had YEN 25779 to spend'
#=> {"YEN"=>["Y", "E", "N"]}
doit 'I had USD 60 and CDN 80 to spend'
#=> {"USD"=>"60", "CDN"=>["C", "D", "N"]}
doit 'USD -100'
#=> {}
doit 'YENS 4000'
#=> {}
Regex demo
Ruby's regex engine performs the following operations.
\b : assert a word boundary
([A-Z]{3}) : match 3 uppercase letters in capture group 1
\ + : match 1+ spaces
(\d+) : match 3 digits in capture group 2
\b : assert a word boundary
TLDR;
An excellent working solution can be found in Wiktor's answer and the rest of the posts.
Long answer:
Since I wasn't perfectly satisfied with Wiktor's explanation of why my solution wasn't working, I decided to dig into it a bit more myself and this is my take on it:
Given the string USD 100, the following regex
(?=.*(USD))(?(1)\d+|[a-zA-Z])
simply won't work. The juice of this whole thing is to figure out why.
It turns out that using a lookahead (?=.*(USD)) with a capture group, implicitly suggests that the position of USD (if any is found) is followed by some pattern (defined inside the conditional ((?(1)\d+|[a-zA-Z])) which in this case yields nothing since there's nothing before USD.
If we break it down in steps here's an outline of what -I think- is happening:
The pointer is set at the very beginning. The lookahead (?=.*(USD)) is parsed and executed.
USD is found but since the expression is a lookahead the pointer remains at the beginning of the string and is not consumed.
The conditional ((?(1)\d+|[a-zA-Z])) is parsed and executed.
Group 1 is set (since USD has been found) however \d+ fails since the pointer searches from the beginning of the string to the beginning of the string which turns out is the furthest point we can search when using a lookahead! After all that's exactly why it's called a lookahead: The searching has to happen across a range which stops just before this one starts.
Since no digits nor anything is found before USD, the regex returns no results. And as Wiktor correctly pointed out:
the second alternative pattern will never be tried, because you required USD to be present in the string for a match to occur.
which basically says that since USD is always present in the string, the system would never jump to the "else" statement even if something was eventually found before USD.
As a counter example if the same regex is tested on this string, it will work:
'YEN USD 100'
Hope this helps someone in the future.

I am trying to write a regex to capture the name string before the comma excluding the Jr.|Sr. and Roman numbers

The example names that I am trying it on are here
O'Kefe,Shiley
Folenza,Mitchel V
Briscoe Jr.,Sanford Ray
Andade-Alarenga,Blnca
De La Cru,Feando
Carone,Letca Jo
O'Conor,Mole K
Daeron III,Lawence P
Randall,Jason L
Esquel Mendez,Mara D
Dinle III,Jams E
Coras Sr.,Cleybr E
Hsieh-Krnk,Caolyn E
Graves II,Theodore R
I am trying to capture everything before comma except the roman numbers and Sr.|Jr. suffix.
So if the name is like Andade-Alarenga,Blnca I want to capture Andade-Alarenga, but if the name is Briscoe Jr.,Sanford Ray I just want Briscoe.
the code I have tried is here
^((?:(?![JjSs][rR]\.|\b(?:[IV]+))[^,]))
also this one - ^(?!\w+ \A[jr|sr|Jr|Sr].*)\w+| \w+ \w+|'\w+|-\w+$
[Regex101 my code with example sets][1]
https://regex101.com/r/jX5cK6/2
One option could be using a capturing group with a non greedy match up till the first occurrence of a comma and optionally before the comma match Jr Sr jr sr or a roman numeral.
Then match the comma itself. The value is in capture group 1.
An extended match for a roman numeral can be found for example on this page as the character class [XVICMD]+ is a broad match which would also allow other combinations.
^(\w.*?)(?: (?:[JjSs]r\.|[XVICMD]+\b))?,
^ Start of string
( Capture group 1
\w.*? Match a word char and 0+ times any char except a newline non greedy
) close group
(?: Non capturing group
(?: Match a space and start non capturing group
[JjSs]r\. Match any of the listed followed by r.
| Or
[XVICMD]+\b Match 1+ times any of the listed and a word boundary
) Close group
)? Close group and make it optional
, Match the comma
Regex demo
Because of your test on Regex101, I'm assuming your regex engine supports positive lookaheads (This is true for PCRE, Javascript or Python, for example)
A positive lookahead will enable you to match only what you want, without the need for capturing groups. The full match will be the string you're looking for.
^[\w'\- ]+?(?= ?(?:\b(?:[IVXCMD]*|\w+\.)),)
The part that matches the name is as simple as it gets:
^[\w'\- ]+?
All it does is match any of the characters on the list. the final ? is there to make it lazy: This way, the engine will only match as few characters as it needs to.
The important part is this one:
(?= ?(?:\b(?:[IVXCMD]*|\w+\.)),)
It is divided in two parts by the pipe (this character: |) there. The first part matches roman numerals (or nothing), and the second part matches titles (Basically, anything that ends on a .). Finally, we need to match the comma, because of your requirement.
Here it is on Regex101
You didn't specify a language so I used a regex in the replaceAll() String method of Java.
String[] names = {
"O'Kefe,Shiley", "Folenza,Mitchel V", "Briscoe Jr.,Sanford Ray",
"Andade-Alarenga,Blnca", "De La Cru,Feando", "Carone,Letca Jo",
"O'Conor,Mole K", "Daeron III,Lawence P", "Randall,Jason L",
"Esquel Mendez,Mara D", "Dinle III,Jams E", "Coras Sr.,Cleybr E",
"Hsieh-Krnk,Caolyn E", "Graves II,Theodore R"
};
for (String name : names) {
System.out.println(name + " -> "
+ name.replaceAll("(I{1,3},|((Sr|Jr)\\.,)|,).*", ""));
}
Here is a python solution using re.sub
import re
names = ["O'Kefe,Shiley", "Folenza,Mitchel V", "Briscoe Jr.,Sanford Ray",
"Andade-Alarenga,Blnca", "De La Cru,Feando", "Carone,Letca Jo",
"O'Conor,Mole K", "Daeron III,Lawence P", "Randall,Jason L",
"Esquel Mendez,Mara D", "Dinle III,Jams E", "Coras Sr.,Cleybr E",
"Hsieh-Krnk,Caolyn E", "Graves II,Theodore R"]
for name in names:
print(name, "->", re.sub("(I{1,3},|((Sr|Jr)\\.,)|,).*","",name))
You may use
^(?:(?![JS]r\.|\b(?:[XVICMD]+)\b)[^,])+\b(?<!\s)
See the regex demo
Details
^ - start of a string
(?:(?![JS]r\.|\b(?:[XVICMD]+)\b)[^,])+ - any char but , ([^,]), one or more occurrences (+), that does not start a Jr. or Sr. char sequence or a whole word consisting of 1 or more X, V, I, C, M,D chars
\b - a word boundary
(?<!\s) - no whitespace immediately to the left is allowed (it is trimming the match)

Regex- Ignore a constant string that matches a pattern

I have this regular expression:
\b[A-Z]{1}[A-Z]{0,7}[0-9]?\b|\b[0-9]{2,3}\b
The desired output is as highlighted:
JOHN went to LONDON one fine day.
JOHN had lunch in a PUB.
JOHN then moved to CHICAGO.
I don't want JOHN to be highlighted.
John does not want this to match the pattern.
Neither this.
But THIS1 should match the pattern.
Also the other 70 times that the pattern should match.
Observed output:
JOHN went to LONDON one fine day.
JOHN had lunch in a PUB.
JOHN then moved to CHICAGO.
I don't want JOHN to be highlighted.
John does not want this to match the pattern.
Neither this.
But THIS1 should match the pattern.
Also the other 70 times that the pattern should match.
The regex works partly but I don't want two constant strings- JOHN and I to match as part of this regex. Please help.
You can use a negative lookahead to exclude those matches. Also, your pattern seems rather "redundant", you may shorten it considerably using grouping and removing unnecessary subpatterns:
\b(?!(?:JOHN|I)\b)(?:[A-Z]{1,8}[0-9]?|[0-9]{2,3})\b
^^^^^^^^^^^^^^^^
See the regex demo
The (?!(?:JOHN|I)\b) is the negative lookahead that fails the match if the word matched is equal to I or JOHN.
Note that {1} can always be omitted as any unquantified pattern is matched once. [A-Z]{1}[A-Z]{0,7} is actually equal to [A-Z]{1,8}.
Pattern details:
\b - word boundary
(?!(?:JOHN|I)\b) - the word matched cannot be equal to JOHN or I
(?:[A-Z]{1,8}[0-9]?|[0-9]{2,3}) - one of the two alternatives:
[A-Z]{1,8}[0-9]? - 1 to 8 uppercase ASCII letters followed with an optional (1 or 0) digit
| - or
[0-9]{2,3} - 2 to 3 digits
\b - trailing word boundary