I want to match specific patterns in multiple lines with the match pattern in Integromat. The language is ECMAScript (JavaScript) FLAVOR.
Salutation: Mr. x Mx. Mrs.
or it could look like this:
Salutation: Mr. Mx. x Mrs.
I want to get the String after x\s, to extract either Mr. Mx. or Mrs..
Currently, I am at this point, but it only matches if the x is before Mr.
Salutation:\s(x\s(.*?)[\s])
How do I need to change it? Thanks in advance!
You might use a capture group with an alternation to match either a single occurrence of Mr. Mrs. or Mx.
\bSalutation:.*?\sx\s(M(?:rs?|x)\.)
The pattern matches:
\bSalutation: Match literally
.*?\sx\s Match any char as least as possible till the first occurrence of x between whitespace chars
( Capture group 1 (in the example referred to as m[1])
M(?:rs?|x)\. Match M followed by either r with optional s or x and then a dot
) Close group 1
Regex demo
const regex = /\bSalutation:.*?\sx\s(M(?:rs?|x)\.)/;
[
"Salutation: Mr. x Mx. Mrs.",
"Salutation: Mr. Mx. x Mrs.",
"Salutation: Mr. Mx. x Mr.",
].forEach(s => {
const m = s.match(regex);
if (m) {
console.log(m[1]);
}
});
If you want to match all of the occurrences after the x, and a lookbehind is supported in the Javascript version:
(?<=\bSalutation:.*\sx\s.*)\bM(?:rs?|x)\.
Regex demo
Related
I need to match a specific pattern but I'm unable to do it with regular expressions. I'm looking for people's name. It follows always the same patterns. Some combinations are:
Mr. Snow
Mr. John Snow
Mr. John Snow (Winterfall of the nord lands)
My problem comes when sometimes I have things like: Mr. Snow and Ms. Stark. It captures also the and. So I'm looking for a regular expression that does not capture the second name only if it is and. Here I'm looking for ["Mr. Snow", "Ms. Stark"].
My best try is as follows:
(M[rs].\s\w+(?:\s[\w-]+)(?:\s\([^\)]*\))?).
Note that the second name is in a non-capturing group. Because I was thinking to use a negative look-ahead, but If I do that, the first word is not captured (because the entire name does not match), and I need that to be captured.
Any Ideas?
Here is some text to fast check.
Here is my two cents:
\bM[rs]\.\h(\p{Lu}\p{Ll}+(?:[\h-]\p{Lu}\p{Ll}+)*)\b
See an online demo
\b - A word-boundary;
M[rs]\.\h - Match Mr. or Ms. followed by a horizontal whitespace;
(\p{Lu}\p{Ll}+(?:[\h-]\p{Lu}\p{Ll}+)*) - A capture group with a nested non-capture group to match an uppercase letter followed by lowercase letters and 0+ 2nd names concatenated through whitespace or hyphen;
\b - A word-boundary.
As it is a name of a person you could also check that the first letters of the words be uppercases.
M[rs].\s[A-Z]\w+(?:\s[A-Z]\w+(?:\s\([^\)]*\))?)?
See the regex demo
Matching names is difficult, see this page for a nice article:
Falsehoods Programmers Believe About Names.
For the examples that you have given, you might use:
\bM[rs]\.(?: (?!M[rs]\.|and )\w+)*
Explanation
\b A word boundary
M[rs]\. Match either Mr or Ms followed by a dot (note to escape it)
(?: Non capture group
Match a space (Or \s+ if you want want to allow newlines)
(?!M[rs]\.|and ) Negative lookahead, assert that from the current position there is not Mr or Ms or and directly to the right
\w+ Match 1+ word characters
)* Close the non capture group and optionally repeat it
Regex demo
This captures the first name in group 1 and the second in group 2if the second name exists and is not and:
(?<=M[rs]\. )(\w+)(?: (?!and)(\w+))?
See live demo.
If you want to capture the title as group 1 and the names as groups 2 and 3, change the look behind to a capture group:
(M[rs]\.) (\w+)(?: (?!and)(\w+))?
This is the PCRE2 regexp:
(?<=hello )(?:[^_]\w++)++
It's intended use is against strings like the following:
Hello Bob (Marius) Smith. -> Match "Bob"
Hello Bob Jr. (Joseph) White -> Match "Bob Jr."
Hello Bob Jr. IInd (Paul) Jobs -> Match "Bob Jr. IInd"
You get the point.
Essentially there is a magic word, in this case "hello", followed by a first name, followed by a second name which is always between parens.
First names could be anything really. A single word, a list of words followed by punctuation, and so on. Heck, look at Elon Musks' kids' name (X Æ A-Xii) to see how weird names can get :)
Let's only assume ascii, though. Æ is not in my targets :)
I'm at a loss on how to convert this Regexp to JS, and the only viable solution I found was to use PCRE2-wasm on node which spins up a wasm virtual machine and sucks up 1gb of resources just for that. That's insane.
This would match your cases in ECMAscript.
(?<=[Hh]ello )(?:[^_][\w.]+)+
You need to look for a capital H done by looking for [Hh] instead of h, as your testcases starts with a capital H and your + needs to be single to be used in ECMAscript.
also you need to include a . with the \w since it is included in some names.
https://regex101.com/r/lkZK7w/1
-- thanks "D M" for pointing out the missing . in the testcase.
#Nils has the correct answer.
If you do need to expand your acceptable character set, you can use the following regex. Check it out. The g, m, and i flags are set.
(?<=hello ).*(?=\([^\)]*?\))
Hello Bob (Marius) Smith.
Hello Bob Jr. (Joseph) White
Hello Bob Jr. IInd (Paul) Jobs
Hello X Æ A-Xii (Not Elon) Musk
Hello Bob ()) Jr. ( (Darrell) Black
Match Number
Characters
Matched Text
Match 1
6-10
Bob
Match 2
32-40
Bob Jr.
Match 3
61-74
Bob Jr. IInd
Match 4
92-102
X Æ A-Xii
Match 5
124-138
Bob ()) Jr. (
The idea is pretty simple:
Look behind for your keyword: (?<=hello ).
Look ahead for your middle name: (?=\([^\)]*?\)) (anything inside a set of parenthesis that is not a closing parenthesis, lazily so you don't take part of the first name).
Take everything between as your first name: .*.
The ++ does not work as Javascript does not support possessive quantifiers.
As a first name, followed by a second name which is always between parens, you might also use a capture group with a match instead of a lookbehind.
\b[Hh]ello (\w+.*?)\s*\([^()\s]+\)
\b[Hh]ello Match hello or Hello
( Capture group 1
\w.*? Match 1+ word chars followed by any char as least as possible
) Close group 1
\s*\([^()\s]*\) Match optional whitespace char followed by ( till )
Regex demo
const regex = /\b[Hh]ello (\w+.*?)\s*\([^()\s]+\)/;
["Hello Bob (Marius) Smith.",
"Hello Bob Jr. (Joseph) White",
"Hello Bob Jr. IInd (Paul) Jobs"
].forEach(s => {
const m = s.match(regex);
if (m) {
console.log(m[1]);
}
})
With the lookbehind, you might also match word characters followed by an optionally repeated capture group matching whitspace chars followed by word characters or a dot.
(?<=[Hh]ello )\w+(?:\s+[\w.]+)*
Regex demo
If i got value like
email = "Mark Johnson (mark#johnson.com)"
How can i print only word in the () and output become
mark#johnson.com
Currently I using regex but It seems doesn't work with my pattern
email = regex("x{(,)}", "var.email")
How can i solve this issue , Thanks for your help.
Terrafrom used RE2 regex engine, and the regex function returns only the captured value if you define a capturing group in the regex pattern. It will return a list of captures if you have more than one capturing group in your pattern, but here, you need just one.
To extract all text inside parentheses:
> regex("[(]([^()]+)[)]", "Mark Johnson (mark#johnson.com)")
The [(] matches a ( char, ([^()]+) captures into Group 1 any one or more chars into Group 1, and [)] matches a ) char.
To extract an email-like string from parentheses:
> regex("[(]([^()#[:space:]]+#[^()[:space:]]+[.][^()[:space:]]+)[)]", "Mark Johnson (mark#johnson.com)")
Here, [^()#[:space:]]+ matches 1 or more chars other than (, ), # and whitespace.
See the regex demo
This is the regular expression I was using for this piece of text:
(?![!',:;?\-\d])(\w[A-Za-z']+)
The flavour of regexp is ECMAScript (JavaScript)
The sample text:
This.Sentence.Has.Some.Funky.Stuff.U.S.S.R.Going.On.And.Contains.Some. ABBREVIATIONS.Too.
This.Sentence.Has.Some.Funky.Stuff .U.S.S.R. Going.On.And.Contains.Some. ABBREVIATIONS.Too.
A.S.A.P.?
Ctrl+Alt+Delete
Mr.Smith bought google.com for 1.5 million dollars, i.e. he paid a lot for it. Did he mind? A.d.a.m Jones Jr. thinks he didn't. In any case, this isn't true... Well, with a probability of .9 it isn't. Mr. John Johnson Jr. was born in the U.S.A but earned his Ph.D. in Israel before joining Nike Inc. as an engineer! He also worked at craigslist.org as a b c d e F G H I J business analyst.
It's doing everything I want but I can't also finish the regexp to match the single letters to a b c d e F G H I J where it's [a-zA-Z] in regexp terms.
I don't want the text such as U.S.A to be matched and this is where I'm having trouble.
I've tried the solution here How to include character in regular expression but I couldn't get that to work due to the more complex nature of my issue.
My mission here is to wrap the matching items with anything.
Here's the link for the same regular expression example:
https://regex101.com/r/Qdq4AY/4
A few notes about the pattern you tried
The pattern (?![!',:;?\-\d])(\w[A-Za-z']+) will not match a single character because this part \w[A-Za-z']+ matches at least 2 characters due to the + quantifier
The negative lookahead (?! asserts what is on the right is not any of [!',:;?\-\d] and then matches a word char \w but \w only also matches a digit \d and not the rest.
One option is to match what you don't want to keep the to capture what you want to keep:
\.?[a-zA-Z](?:\.[a-zA-Z])+\.?|\.[a-zA-Z]\.|(?!\d)(\w[A-Za-z']*)
In parts
\.? Match an optional dot
[a-zA-Z](?:\.[a-zA-Z])+\.? Match a single char a-zA-Z followed by repeating 1+ times a dot and a single char and an optional dot
| Or
\.[a-zA-Z]\. Match a char a-zA-Z between 2 dots
| or
(?!\d) Assert what is on the right is not a digit
(\w[A-Za-z']*) Capture in group 1 matching 1+ word char and repeat 0+ times any of the listed in the character class
Regex demo
For example
const regex = /\.?[a-zA-Z](?:\.[a-zA-Z])+\.?|\.[a-zA-Z]\.|(?!\d)(\w[A-Za-z']*)/g;
const str = `This.Sentence.Has.Some.Funky.Stuff.U.S.S.R.Going.On.And.Contains.Some. ABBREVIATIONS.Too.
This.Sentence.Has.Some.Funky.Stuff .U.S.S.R. Going.On.And.Contains.Some. ABBREVIATIONS.Too.
A.S.A.P.?
Ctrl+Alt+Delete
Mr.Smith bought google.com for 1.5 million dollars, i.e. he paid a lot for it. Did he mind? A.d.a.m Jones Jr. thinks he didn't. In any case, this isn't true... Well, with a probability of .9 it isn't. Mr. John Johnson Jr. was born in the U.S.A but earned his Ph.D. in Israel before joining Nike Inc. as an engineer! He also worked at craigslist.org as a b c d e F G H I J business analyst.`;
let m;
while ((m = regex.exec(str)) !== null) {
// This is necessary to avoid infinite loops with zero-width matches
if (m.index === regex.lastIndex) {
regex.lastIndex++;
}
if (undefined !== m[1]) {
console.log(m[1]);
}
}
I'm quite terrible at regexes.
I have a string that may have 1 or more words in it (generally 2 or 3), usually a person name, for example:
$str1 = 'John Smith';
$str2 = 'John Doe';
$str3 = 'David X. Cohen';
$str4 = 'Kim Jong Un';
$str5 = 'Bob';
I'd like to convert each as follows:
$str1 = 'John S.';
$str2 = 'John D.';
$str3 = 'David X. C.';
$str4 = 'Kim J. U.';
$str5 = 'Bob';
My guess is that I should first match the first word, like so:
preg_match( "^([\w\-]+)", $str1, $first_word )
then all the words after the first one... but how do I match those? should I use again preg_match and use offset = 1 in the arguments? but that offset is in characters or bytes right?
Anyway after I matched the words following the first, if the exist, should I do for each of them something like:
$second_word = substr( $following_word, 1 ) . '. ';
Or my approach is completely wrong?
Thanks
ps - it would be a boon if the regex could maintain the whole first two words when the string contain three or more words... (e.g. 'Kim Jong U.').
It can be done in single preg_replace using a regex.
You can search using this regex:
^\w+(?:$| +)(*SKIP)(*F)|(\w)\w+
And replace by:
$1.
RegEx Demo
Code:
$name = preg_replace('/^\w+(?:$| +)(*SKIP)(*F)|(\w)\w+/', '$1.', $name);
Explanation:
(*FAIL) behaves like a failing negative assertion and is a synonym for (?!)
(*SKIP) defines a point beyond which the regex engine is not allowed to backtrack when the subpattern fails later
(*SKIP)(*FAIL) together provide a nice alternative of restriction that you cannot have a variable length lookbehind in above regex.
^\w+(?:$| +)(*SKIP)(*F) matches first word in a name and skips it (does nothing)
(\w)\w+ matches all other words and replaces it with first letter and a dot.
You could use a positive lookbehind assertion.
(?<=\h)([A-Z])\w+
OR
Use this regex if you want to turn Bob F to Bob F.
(?<=\h)([A-Z])\w*(?!\.)
Then replace the matched characters with \1.
DEMO
Code would be like,
preg_replace('~(?<=\h)([A-Z])\w+~', '\1.', $string);
DEMO
(?<=\h)([A-Z]) Captures all the uppercase letters which are preceeded by a horizontal space character.
\w+ matches one or more word characters.
Replace the matched chars with the chars inside the group index 1 \1 plus a dot will give you the desired output.
A simple solution with only look-ahead and word boundary check:
preg_replace('~(?!^)\b(\w)\w+~', '$1.', $string);
(\w)\w+ is a word in the name, with the first character captured
(?!^)\b performs a word boundary check \b, and makes sure the match is not at the start of the string (?!^).
Demo