I want to capture a word placed before another one which is full capitalized
Mister Foo BAR is here # => "Foo"
Miss Bar-Barz FOO loves cats # => "Bar-Barz"
I've been trying the following regex: (Mister|Miss)\s([[:alpha:]\s\-]+)(?=\s[A-Z]+), but sometimes it includes the rest of the sentence. For example, it'll return Bar-Barz FOO loves cats instead of Bar-Barz).
How can I say, using RegExp, "match every words until the upcase word" ?
To clarify the usage of negative lookahead, can we say it "captures until the specified sub-pattern matches, but does not include it to the match data" ?
As a non-native English speaker, apologies if my answer isn't perfectly formulated. Thanks by advance
Match 1+ word chars optionally repeated by a - and 1+ word chars to not match only hyphens or a hyphen at the end.
Assert a space followed by 1+ uppercase chars and a word boundary at the right.
\w+(?:-\w+)*(?=\s[A-Z]+\b)
Explanation
\w+ Match 1+ word char
(?:-\w+)* Optionally repeat matching - and 1+ word chars
(?=\s[A-Z]+\b) Positive lookahead, assert what is directly at the right is 1+ uppercase chars A-Z followed by a word boundary
Regex demo
If there can not be any newlines between the words, you can use [^\S\r\n] instead of \s
\w+(?:-\w+)*(?=[^\S\r\n]+[A-Z]+\b)
Regex demo
I want to capture a word placed before another one which is full capitalized
You may use this regex with a lookahead:
\b\S+(?=[ \t]+[A-Z]+\b)
RegEx Demo
RegEx Description:
\b: Word boundadry
\S+: Match 1+ non-whitespace characters
(?=[ \t]+[A-Z]+\b): Positive lookahead that asserts we have 1+ space and then a word containing only capital letters
You don't say what language you're working in, but the following works for me. The idea is to stop when the parser hits a sequence of uppercase letters/hyphens.
JS example:
let ptn = /(Mister|Miss)\s[\w\-]+(?=\s[A-Z\-]+)/;
"Mister Foo BAR is here".match(ptn); //["Mister Foo", "Mister"]
"Miss Bar-Barz FOO loves cats".match(ptn); //["Miss Bar-Barz", "Miss"]
Related
I am trying to solve http://play.inginf.units.it/#/level/10
I have some strings as follows:
title={AUTOMATIC ROCKING DEVICE},
author={Diaz, Navarro David and Gines, Rodriguez Noe},
year={2006},
title={The sitting position in neurosurgery: a retrospective analysis of 488 cases},
author={Standefer, Michael and Bay, Janet W and Trusso, Russell},
journal={Neurosurgery},
title={Fuel cells and their applications},
author={Kordesch, Karl and Simader, G{"u}nter and Wiley, John},
volume={117},
I need to match the names in bold. I tried the following regex:
(?<=author={).+(?=})
But it matches the entire string inside {}. I understand why is it so but how can I break the pattern with and?
It took me a little while to get the samples to show up in your link. What about:
(?:^\s*author={|\G(?!^) and )\K(?:(?! and |},).)+
See an online demo
(?:^\s*author={|\G(?!^) and ) - Either match start of a line followed by 0+ whitespace chars and literally match 'author={` or assert position at end of previous match but negate start-line;
\K - Reset starting point of reported match;
(?:(?! and |},).)+ - Match any if it's not followed by ' and ' or match a '}' followed by a comma.
Above will also match 'others' as per last sample in linked test. If you wish to exclude 'others' then maybe add the option to the negated list as per:
(?:^\s*author={|\G(?!^) and )\K(?:(?! and |},|\bothers\b).)+
See an online demo
In the comment section we established above would not work for given linked website. Apparently its JS based which would support zero-width lookbehind. Therefor try:
(?<=\bauthor={(?:(?!\},).*?))\b[A-Z]\S*\b(?:,? [A-Z]\S*\b)*
See the demo
(?<= - Open lookbehind;
\bauthor={ - Match word-boundary and literally 'author={';
(?:(?!\},).*?)) - Open non-capture group to match a negative lookahead for '},' and 0+ (lazy) characters. Close lookbehind;
\b[A-Z]\S*\b - Match anything between two word-boundaries starting with a capital letter A-Z followed by 0+ non-whitespace chars;
(?:,? [A-Z]\S*\b)* - A 2nd non-capture group to keep matching comma/space seperated parts of a name.
If using a lookbehind assertion is supported and matching word characters, you might use:
(?<=\bauthor={[^{}]*(?:{[^{}]*}[^{}]*)*)[A-Z][^\s,]*,(?:\s+[A-Z][^\s,]*)+\b
Explanation
(?<= Postive lookahead, assert that to the left of the current position is
\bauthor={ Match author={ preceded by a word boundary
[^{}]*(?:{[^{}]*}[^{}]*)* Match optional chars other than { } or match {...}
) Close the lookbehind
[A-Z] Match an uppercase char A-Z
[^\s,]*, Optionally match non whitespace chars except , and then match ,
(?: Non capture group to repeat as a whole part
\s+[A-Z][^\s,]* Match 1+ whitespace chars, uppercase char A-Z, optional non whitespace chars except ,
)+ Close the non capture group and repeat it 1 or more times
\b a word boundary
See a regex101 demo.
I've been trying to solve this problems for few hours but with no luck. The task is to write a regular expression that matches at least four words starting with the same letter. But! These words do not have to be one after another.
This regex should be able to match a line like this:
cat color coral chat
but also one like this:
cat take boom candle creepy drum cheek
Thank you!
So far I have got this regex but it only matches words when they are in order.
(\w)\w+\s+\1\w+\s+\1\w+\s+\1
If you have only words in the line that can be matched with \w:
\b(\w)\w*(?:(?:\s+\w+)*?\s+\1\w*){3}
Explanation
\b A word boundary to prevent a partial word match
(\w)\w* Capture a single word character in group 1 followed by matching optional word characters
(?: Non capture group to repeat as a whole part
(?:\s+\w+)*? Match 1+ whitespace chars and 1+ word chars in between in case the word does not start with the character captured in the back reference
\s+\1\w* Match 1+ whitespace chars, a backreference to the same captured character and optional word characters
){3} Close the non capture group and repeat 3 times
See a regex demo
Note that \s can also match a newline.
If the words that should with the same character should be at least 2 characters long (as (\w)\w+ matches 2 or more characters)
\b(\w)\w+(?:(?:\s+\w+)*?\s+\1\w+){3}
See another regex demo.
Another idea to match lines with at least 4 words starting with the same letter:
\b(\w)(?:.*?\b\1){3}
See this demo at regex101
This is not very accurate, it just checks if there are three \b word boundaries, each followed by \1 in the first group \b(\w) captured character to the right with .*? any characters in between.
aws-sdk-java/1.9.4 Linux/3.10.0-862.mt20190308.130.el7.x86_64 Java_HotSpot(TM)_64-Bit_Server_VM/25.45-b02/1.8.0_45
I want to get substr 'aws-sdk-java/1.9.4'
Here is my regular
(\S+?\/\S+?)(\s|$)
but it matches many times
is someone can help me? Thank you very much~
You could make the pattern a bit more specific, and get a match only without capture groups.
(?<!\S)\w+(?:-\w+)*\/\d+(?:\.\w+)*(?!\S)
Explanation
(?<!\S) Assert a whitespace boundary to the left
\w+(?:-\w+)* Match 1+ word chars and optionally repeat - and 1+ word chars
\/ Match / (Depending on the delimiter of the pattern, you don't have to escape the /)
\d+(?:\.\w+)* Match 1+ digits and optionally repeat . and 1+ word characters
(?!\S) Assert a whitespace boundary to the right
Regex demo
Or a boader variant:
(?<!\S)[^\/\s]+\/\w+(?:\.\w+)*(?!\S)
regex demo
I need to match all uppercase words that don't start with a hyphen.
There are multiple uppercase words in each line.
examples:
,BOAT -> match
BANANA, -> match
WATER -> match
-ER -> no match because of hyphen
Thanks in advance :)
I need to match all uppercase words that don't start with a hyphen.
You may use this regex:
(?<!\S)[^-A-Z\s]*[A-Z]+
RegEx Demo
RegEx Explained:
(?<!\S): Make sure we don't have a non-space before current position
[^-A-Z\s]*: Match 0 or more of any characters that are not hyphen and not uppercase letters and not whitespaces
[A-Z]+: Match 1+ uppercase letters
You can use
\b(?<!-)[A-Z]+\b
\b(?<!-)\p{Lu}+\b
See the regex demo
Details:
\b - word boundary
(?<!-) - a negative lookbehind that fails the match if there is a - immediately to the left of the current position
[A-Z]+ / \p{Lu}+ - one or more uppercase letters (\p{Lu} matches any uppercase Unicode letters)
\b - word boundary.
I need a regex to select a word that each char on that word separated by whitespace. Look at the following string
Mengkapan,Sungai Apit,S I A K,Riau,
I want to select S I A K. I am stuck, I was trying to use the following regex
\s+\w{1}\s+
but it's not working.
I suggest
\b[A-Za-z](?:\s+[A-Za-z])+\b
pattern, where
\b - word boundary
[A-Za-z] - letter (exactly one)
(?: - one or more groups of
\s+ - white space (at least one)
[A-Za-z] - letter (exactly one)
)+
\b - word boundary
For your given information, you could use
(?:[A-Za-z] ){2,}[A-Za-z]
See a demo on regex101.com.
You could match a word boundary \b, a word character \w and repeat at least 2 times a space and a word character followed by a word boundary:
\b\w(?: \w){2,}\b
Regex demo