In the app I use, I cannot select a match Group 1.
The result that I can use is the full match from a regex.
but I need the 5th word "jumps" as a match result and not the complete match "The quick brown fox jumps"
^(?:[^ ]*\ ){4}([^ ]*)
The quick brown fox jumps over the lazy dog
Here is a link https://regex101.com/r/nB9yD9/6
Since you need the entire match to be only the n-th word, you can try to use 'positive lookbehind', which allows you to only match something, if it is preceded by something else.
To match only the fifth word, you want to match the first word that has four words before it.
To match four words (i.e. word characters followed by a space character):
(\w+\s){4}
To match a single word, but only if it was preceded by four other words:
(?<=(\w+\s){4})(\w+)
Test the result here https://regex101.com/r/QIPEkm/1
To find the 3rd word of sentence, use:
^(?:\w+ ){2}\K\w+
Explanation:
^ # beginning of line
(?: # start non capture group
\w+ # 1 or more word character
# a space
){2} # group must appear twice (change {2} in {3} to get the 4th word and so on)
\K # forget all we have seen until this position
\w+ # 1 or more word character
Demo
It works https://regex101.com/r/pR22LK/2 with PCRE. Your app doesn't seem to support it, but I don't know how it works. I think you have to extract all the words in an array then select the ones you want. – Toto 23 hours ago
Hello Toto, your solution works in the the App too, like PCRE, thanks !!! – gsxr1300 just now edit
To match "the first" four words (i.e. word characters followed by a space character):
^(\w+\s){4}
To match a single word, but only if it was preceded by "the first" four other words:
(?<=^(\w+\s){4})(\w+)
Note the ^ difference
If you want to know what this "?<=" mean, check this:
https://stackoverflow.com/a/2973495/11280142
Related
I am trying to write a regex that finds the first word in each line that contains the character a.
For a string like:
The cat ate the dog
and the mouse
The expression should find cat and
So far, I have:
/\b\w*a\w*\b/g
However this will return every match in each line, not just the first match (cat ate and).
What is the easiest way to only return the first occurrence?
Assuming you are onluy looking for words without numbers and underscores (\w would include those), I'd advise to maybe use:
(?i)^.*?(?<!\S)([b-z]*a[a-z]*)(?!\S)
And use whatever is in the 1st capture group. See an online demo. Or, if supported:
(?i)^.*?\K(?<!\S)[b-z]*a[a-z]*(?!\S)
See an online demo.
Please note that I used lookaround to assert that the word is not inbetween anything other than whitespace characters. You may also use word-boundaries if you please and swap those lookarounds for \b. Also, depending on your application you can probably scratch the inline case-insensitive switch to a 'flag'. For example, if you happen to use JavaScript /^.*?(?<!\S)([b-z]*a[a-z]*)(?!\S)/gmi should probably be your option. See for example:
var myString = "The cat ate the dog\nand the mouse";
var myRegexp = new RegExp("^.*?(?<!\S)([b-z]*a[a-z]*)(?!\S)", "gmi");
m = myRegexp.exec(myString);
while (m != null) {
console.log(m[1])
m = myRegexp.exec(myString);
}
If you want to match a word using \w you might also use a negated character class matching any character except a or a newline.
Then match a word that consists of at least an a char with word boundaries \b
^[^a\n\r]*\b([^\Wa]*a\w*)
The pattern matches:
^ Start of string
[^a\n\r]*\b Optionally match any character except a or a newline
( Capture group 1
[^\Wa]*a\w* Optionally match a word character without a, then match a and optional word characters
) Close group 1
Regex demo
Using whitespace boundaries on the left and right:
^[^a\n\r]*(?<!\S)([^\Wa]*a\w*)(?!\S)
Regex demo
The text could be matched with the regular expression
(?=(\b[a-z]*a[a-z]*\b)).*\r?\n
with the multiline and case-indifferent flags set. For each match capture group 1 contains the first word (comprised only of letters) in a line that contains an "a". There are no matches in lines that do not contain an "a".
Demo
The expression can be broken down as follows.
(?= # begin a positive lookahead
\b # match a word boundary
([a-z]*a[a-z]*) # match a word containing an "a" and save to
# capture group 1
)
.*\r?\n # match the remainder of the line including the
# line terminator
Hard to word this correctly, but TL;DR.
I want to match, in a given text sentence (let's say "THE TREE IS GREEN") if any space is doubled (or more).
Example:
"In this text,
THE TREE IS GREEN should not match,
THE TREE IS GREEN should
and so should THE TREE IS GREEN
but double-spaced TEXT SHOULD NOT BE FLAGGED outside the pattern."
My initial approach would be
/THE( {2,})TREE( {2,})IS( {2,})GREEN/
but this only matches if all spaces are double in the sequence, therefore I'd like to make any of the groups trigger a full match. Am I going the wrong way, or is there a way to make this work?
You can use Negative lookahead if there is an option.
First match the sentence that you want to fail, in your case, it is "THE TREE IS GREEN" then give the most generic case that wants to catch your desired result.
(?!THE TREE IS GREEN)(THE[ ]+TREE[ ]+IS[ ]+GREEN)
https://regex101.com/r/EYDU6g/2
You can just search for the spaces that you're looking for:
/ {2,}/ will work to match two or more of the space character. (https://regexr.com/4h4d4)
You can capture the results by surrounding it with parenthesis - /( {2,})/
You may want to broaden it a bit.
/\s{2,}/ will match any doubling of whitespace.
(\s - means any whitespace - space, tab, newline, etc.)
No need to match the whole string, just the piece that's of interest.
If I am not mistaken you want the whole match if there is a part present where there are 2 or more spaces between 2 uppercased parts.
If that is the case, you might use:
^.*[A-Z]+ {2,}[A-Z]+.*$
^ Start of string
.*[A-Z]+ match any char except a newline 0+ time, then match 1+ times [A-Z]
[ ]{2,} Match 2 or more times a space (used square brackets for clarity)
A-Z+ Match 1+ times an uppercase char
.*$ Match any char except a newline 0+ times until the end of the string
Regex demo
You could do this:
import re
pattern = r"THE +TREE +IS +GREEN"
test_str = ("In this text,\n"
"THE TREE IS GREEN should not match,\n"
"THE TREE IS GREEN should\n"
"and so should THE TREE IS GREEN\n"
"but double-spaced TEXT SHOULD NOT BE FLAGGED outside the pattern.")
matches = re.finditer(pattern, test_str, re.MULTILINE)
for matchNum, match in enumerate(matches, start=1):
if match.group() != 'THE TREE IS GREEN':
print ("{match}".format(match = match.group()))
Im trying to search for characters within a string, but only within a range of the search string itself.
For example, lets say I have to look for the character 'o' in;
the quick fox jumped over the lazy dog
But, I only need to search for this character with the range of character 20 (the letter 'd') and character 25 (the letter 'r').
How would I write a regex expression to match just this character between both positions?
I have tried ^(.{20})o(.{13})$ to no avail. All I can find is resources about character ranges, ([A-Z] for example) instead of positional ranges
You can use this regex :
/^.{0,20}.*(o).*r/
In this regex, an anchor is placed at first ^ to make sure the match begin from the first character of the string, next we move from 0 to 20, precisely the end of letter d of jumped, then we use .* because we don't know how much space to reach the char o and another .* till we reach r,
demo https://regex101.com/r/PLHS43/1
There is another way using this regex:
/^.{0,20}.*(o).*?r{1}/
It basically does the same but it stops when it found the first r and match the o what is between char 20 and 25
demo: https://regex101.com/r/3cX2gw/1
Do you have a compulsory search for a single regex? Unix prides itself on the passionate use of pipes to connect commands instead of writing complicated and therefore uncertain expressions.
in Shell
echo 'the quick fox jumped over the lazy dog' | cut -c 20-25
or in Javascript:
'the quick fox jumped over the lazy dog'.substr(19,6)
both will give a "d over" slice, and then a simple expression to find the letter "o" and make a section of what you want in the next step.
Designing an expression for the given problem is quite a puzzle, maybe we could just start with:
^.{0,21}\K((?:[^o]*)(o*)|(o*)(?:[^o]*)).{4}.*\K$
yet we'd encounter challenges, including the failure of 4 quantifier, when any o is being found.
My guess is that some sort of recursion might be required, difficult to integrate though.
Demo
If you want to capture a single o, you might use a capturing group:
^.{20}[^o]*(o)
^ Start of string
.{20} Match any character 20 times
[^o]* Match 0+ times not o
(o) Capture in group 1 matching o
Regex demo
If you want to capture multiple times an o and finite/infinite lookbehind is supported you might use:
(?<=^.{20,24})o
(?<= Positive lookbehind, assert what is on the left is
^ Assert start of string
.{20,24} Match 20 - 24 times any character except a newline
) Close positive lookahead
o Match '
For example a regex demo in C#
This finds the letter "o" between the 20th and the 25th character in a string:
^.{20}[^o]{0,4}\Ko
**Explanation:
^ # beginning of line
.{20} # 20 any characters
[^o]{0,4} # 0 up to 4 any character that is not o
\K # forget all we have seen until this psition
o # the letter o
Demo
I want to match specific strings from beginning to 5th word of article title.
Input string:
The 14 best US colleges in the West are dominated by California — here's who makes the cut.
regex:
/^.*(\bbest\b|\btop\b|\bhot\b).*$/
Currently matched whole article title but want to search till "colleges".
and also need ignore or not matched strings like laptop,hot-spot etc.
You can use this expression
^((?:\w+\s?){1,5}).*
Explanation:
^ assert position at start of the string
\w+ match any word character
\s? match any white space character
{1,5} Quantifier - Between 1 and 5 times, as many times as possible
.* matches any character (except newline)
This matches the first 5 words (and spaces).
^(\w+\s){0,4}\b(best|top|hot)(\s|$)
You want to match string within first five words of input sentence. Then if counted from the start the sentence, there must be 0-4 words before the word you want to match. So you need ^(\w+\s){0,4} before the specific words you want to match. See https://regex101.com/r/nS0dU6/4
regex101 comes to help again.
^(?=(?:\w+\s){0,4}?(?:best|top|hot)\b(?!-))(\w+(?:\s\w+){0,4})
(?=(?:\w+\s){0,4}?(?:best|top|hot)\b(?!-) checks that the keyword is within first 5 (note that (?!-) is added to cater for words such as hot-spot)
(\w+(?:\s\w+){0,4}) then matches the first maximum 5 words
I want to match entire words (or strings really) that containing only defined characters.
For example if the letters are d, o, g:
dog = match
god = match
ogd = match
dogs = no match (because the string also has an "s" which is not defined)
gods = no match
doog = match
gd = match
In this sentence:
dog god ogd, dogs o
...I would expect to match on dog, god, and o (not ogd, because of the comma or dogs due to the s)
This should work for you
\b[dog]+\b(?![,])
Explanation
r"""
\b # Assert position at a word boundary
[dog] # Match a single character present in the list “dog”
+ # Between one and unlimited times, as many times as possible, giving back as needed (greedy)
\b # Assert position at a word boundary
(?! # Assert that it is impossible to match the regex below starting at this position (negative lookahead)
[,] # Match the character “,”
)
"""
The following regex represents one or more occurrences of the three characters you're looking for:
[dog]+
Explanation:
The square brackets mean: "any of the enclosed characters".
The plus sign means: "one or more occurrences of the previous expression"
This would be the exact same thing:
[ogd]+
Which regex flavor/tool are you using? (e.g. JavaScript, .NET, Notepad++, etc.) If it's one that supports lookahead and lookbehind, you can do this:
(?<!\S)[dog]+(?!\S)
This way, you'll only get matches that are either at the beginning of the string or preceded by whitespace, or at the end of the string or followed by whitespace. If you can't use lookbehind (for example, if you're using JavaScript) you can spell out the leading condition:
(?:^|\s)([dog]+)(?!\S)
In this case you would retrieve the matched word from group #1. But don't take the next step and try to replace the lookahead with (?:$|\s). If you did that, the first hit ("dog") would consume the trailing space, and the regex wouldn't be able to use it to match the next word ("god").
Depending on the language, this should do what you need it to do. It will only match what you said above;
this regex:
[dog]+(?![\w,])
in a string of ..
dog god ogd, dogs o
will only match..
dog, god, and o
Example in javascript
Example in php
Anything between two [](brackets) is a character class.. it will match any character between the brackets. You can also use ranges.. [0-9], [a-z], etc, but it will only match 1 character. The + and * are quantifiers.. the + searches for 1 or more characters, while the * searches for zero or more characters. You can specify an explicit character range with curly brackets({}), putting a digit or multiple digits in-between: {2} will match only 2 characters, while {1,3} will match 1 or 3.
Anything between () parenthesis can be used for callbacks, say you want to return or use the values returned as replacements in the string. The ?! is a negative lookahead, it won't match the character class after it, in order to ensure that strings with the characters are not matched when the characters are present.