regex - Checking for characters within 'line range' - regex

Im trying to search for characters within a string, but only within a range of the search string itself.
For example, lets say I have to look for the character 'o' in;
the quick fox jumped over the lazy dog
But, I only need to search for this character with the range of character 20 (the letter 'd') and character 25 (the letter 'r').
How would I write a regex expression to match just this character between both positions?
I have tried ^(.{20})o(.{13})$ to no avail. All I can find is resources about character ranges, ([A-Z] for example) instead of positional ranges

You can use this regex :
/^.{0,20}.*(o).*r/
In this regex, an anchor is placed at first ^ to make sure the match begin from the first character of the string, next we move from 0 to 20, precisely the end of letter d of jumped, then we use .* because we don't know how much space to reach the char o and another .* till we reach r,
demo https://regex101.com/r/PLHS43/1
There is another way using this regex:
/^.{0,20}.*(o).*?r{1}/
It basically does the same but it stops when it found the first r and match the o what is between char 20 and 25
demo: https://regex101.com/r/3cX2gw/1

Do you have a compulsory search for a single regex? Unix prides itself on the passionate use of pipes to connect commands instead of writing complicated and therefore uncertain expressions.
in Shell
echo 'the quick fox jumped over the lazy dog' | cut -c 20-25
or in Javascript:
'the quick fox jumped over the lazy dog'.substr(19,6)
both will give a "d over" slice, and then a simple expression to find the letter "o" and make a section of what you want in the next step.

Designing an expression for the given problem is quite a puzzle, maybe we could just start with:
^.{0,21}\K((?:[^o]*)(o*)|(o*)(?:[^o]*)).{4}.*\K$
yet we'd encounter challenges, including the failure of 4 quantifier, when any o is being found.
My guess is that some sort of recursion might be required, difficult to integrate though.
Demo

If you want to capture a single o, you might use a capturing group:
^.{20}[^o]*(o)
^ Start of string
.{20} Match any character 20 times
[^o]* Match 0+ times not o
(o) Capture in group 1 matching o
Regex demo
If you want to capture multiple times an o and finite/infinite lookbehind is supported you might use:
(?<=^.{20,24})o
(?<= Positive lookbehind, assert what is on the left is
^ Assert start of string
.{20,24} Match 20 - 24 times any character except a newline
) Close positive lookahead
o Match '
For example a regex demo in C#

This finds the letter "o" between the 20th and the 25th character in a string:
^.{20}[^o]{0,4}\Ko
**Explanation:
^ # beginning of line
.{20} # 20 any characters
[^o]{0,4} # 0 up to 4 any character that is not o
\K # forget all we have seen until this psition
o # the letter o
Demo

Related

PCRE Regex: Is it possible to check within only the first X characters of a string for a match

PCRE Regex: Is it possible for Regex to check for a pattern match within only the first X characters of a string, ignoring other parts of the string beyond that point?
My Regex:
I have a Regex:
/\S+V\s*/
This checks the string for non-whitespace characters whoich have a trailing 'V' and then a whitespace character or the end of the string.
This works. For example:
Example A:
SEBSTI FMDE OPORV AWEN STEM students into STEM
// Match found in 'OPORV' (correct)
Example B:
ARKFE SSETE BLMI EDSF BRNT CARFR (name removed) Academy Networking Event
//Match not found (correct).
Re: The capitalised text each letter and the letters placement has a meaning in the source data. This is followed by generic info for humans to read ("Academy Networking Event", etc.)
My Issue:
It can theoretically occur that sometimes there are names that involve roman numerals such as:
Example C:
ARKFE SSETE BLME CARFR Academy IV Networking Event
//Match found (incorrect).
I would like my Regex above to only check the first X characters of the string.
Can this be done in PCRE Regex itself? I can't find any reference to length counting in Regex and I suspect this can't easily be achieved. String lengths are completely arbitary. (We have no control over the source data).
Intention:
/\S+V\s*/{check within first 25 characters only}
ARKFE SSETE BLME CARFR Academy IV Networking Event
^
\- Cut off point. Not found so far so stop.
//Match not found (correct).
Workaround:
The Regex is in PHP and my current solution is to cut the string in PHP, to only check the first X characters, typically the first 20 characters, but I was curious if there was a way of doing this within the Regex without needing to manipulate the string directly in PHP?
$valueSubstring = substr($coreRow['value'],0,20); /* first 20 characters only */
$virtualCount = preg_match_all('/\S+V\s*/',$valueSubstring);
The trick is to capture the end of the line after the first 25 characters in a lookahead and to check if it follows the eventual match of your subpattern:
$pattern = '~^(?=.{0,25}(.*)).*?\K\S+V\b(?=.*\1)~m';
demo
details:
^ # start of the line
(?= # open a lookahead assertion
.{0,25} # the twenty first chararcters
(.*) # capture the end of the line
) # close the lookahead
.*? # consume lazily the characters
\K # the match result starts here
\S+V # your pattern
\b # a word boundary (that matches between a letter and a white-space
# or the end of the string)
(?=.*\1) # check that the end of the line follows with a reference to
# the capture group 1 content.
Note that you can also write the pattern in a more readable way like this:
$pattern = '~^
(*positive_lookahead: .{0,20} (?<line_end> .* ) )
.*? \K \S+ V \b
(*positive_lookahead: .*? \g{line_end} ) ~xm';
(The alternative syntax (*positive_lookahead: ...) is available since PHP 7.3)
You can find your pattern after X chars and skip the whole string, else, match your pattern. So, if X=25:
^.{25,}\S+V.*(*SKIP)(*F)|\S+V\s*
See the regex demo. Details:
^.{25,}\S+V.*(*SKIP)(*F) - start of string, 25 or more chars other than line break chars, as many as possible, then one or more non-whitespaces and V, and then the rest of the string, the match is failed and skipped
| - or
\S+V\s* - match one or more non-whitespaces, V and zero or more whitespace chars.
Any V ending in the first 25 positions
^.{1,24}V\s
See regex
Any word ending in V in the first 25 positions
^.{1,23}[A-Z]V\s

Match the nth word in a line

In the app I use, I cannot select a match Group 1.
The result that I can use is the full match from a regex.
but I need the 5th word "jumps" as a match result and not the complete match "The quick brown fox jumps"
^(?:[^ ]*\ ){4}([^ ]*)
The quick brown fox jumps over the lazy dog
Here is a link https://regex101.com/r/nB9yD9/6
Since you need the entire match to be only the n-th word, you can try to use 'positive lookbehind', which allows you to only match something, if it is preceded by something else.
To match only the fifth word, you want to match the first word that has four words before it.
To match four words (i.e. word characters followed by a space character):
(\w+\s){4}
To match a single word, but only if it was preceded by four other words:
(?<=(\w+\s){4})(\w+)
Test the result here https://regex101.com/r/QIPEkm/1
To find the 3rd word of sentence, use:
^(?:\w+ ){2}\K\w+
Explanation:
^ # beginning of line
(?: # start non capture group
\w+ # 1 or more word character
# a space
){2} # group must appear twice (change {2} in {3} to get the 4th word and so on)
\K # forget all we have seen until this position
\w+ # 1 or more word character
Demo
It works https://regex101.com/r/pR22LK/2 with PCRE. Your app doesn't seem to support it, but I don't know how it works. I think you have to extract all the words in an array then select the ones you want. – Toto 23 hours ago
Hello Toto, your solution works in the the App too, like PCRE, thanks !!! – gsxr1300 just now edit
To match "the first" four words (i.e. word characters followed by a space character):
^(\w+\s){4}
To match a single word, but only if it was preceded by "the first" four other words:
(?<=^(\w+\s){4})(\w+)
Note the ^ difference
If you want to know what this "?<=" mean, check this:
https://stackoverflow.com/a/2973495/11280142

Simple trouble with regular expression

I have this string:
I have an eraser and 2 pencils.
Jane has a ruler and a stapler.
I need to get all the items that I have (lines starting with I have). I have tried these expressions:
(?:I have|and)\h+((?:a|an|\d+)\h+(?:\w+))
# returns some of the items that Jane has.
(I have )(?(1)((?:a|an|\d+) \w+))
# returns only the word closest to the beginning of the string.
I'm looking for a way to match a given string/expression at the beginning of the line or somewhere before the capturing group. Thanks in advance.
Note: I'm working with PCRE
It's still tricky do have a variable number of groups, but you can try this:
I have (?:an |a )?(\d? ?\w+)(\(?: and (?:an |a )?(\d? ?\w+))?(?: and (?:an |a )?(\d? ?\w+))?(?: and (?:an |a )?(\d? ?\w+))?
Below are some sample results:
"I have an eraser and a pencil and an item" -> ["eraser", "pencil", "item"]
"She has a turtle and a car" -> []
"I have 3 bricks and 4 knees and a tie" -> ["3 bricks", "4 knees", "tie"]
"I have a motorcycle and a bag" -> ["motorcycle", "bag"]
"I have a journal" -> ["journal"]
"I have wires and tires" -> ["wires", "tires"]
"I must say I have a train and a bicycle" -> ["train", "bicycle"]
For each line, it will capture a maximum number of 3 items.
This is a typical case for anchoring at the end of previous match with \G.
We're trying to match some text followed by an unknown number of tokens, and it needs to capture each token individually. The regex engine is totally capable of repeating a construct to match repeating token, but each backreference must be defined on its own. Therefore, repeating a capturing group ends up overwriting its stored value and returning only the last matched value. This task may be achieved by 2 different strategies: either capturing all tokens with 1 pattern and then using a second pattern match to split them, or performing one full match for each token.
Instead of trying to get all the items "I have" in the same match, we're going to attempt to match once per item. This approach was also tried with some of the patterns proposed in the comments. However, as you may have realized, the regex engine also matches from the middle of the string, and thus matching unwanted cases like:
She has >>a turtle<< ...
This is where we can use an anchor like \G. Our strategy will be:
Match ^I have and capture 1 item (the match ends here).
In consecutive match, start at the end of previous match, and match 1 item.
Repeat (2) for successive matches.
Now, this can be translated to regex:
^I have an? + the token
Literal text at the beggining of the line.
an or a.
And we'll cover the the token construct later.
\G(?!^)(?: and)? an? + the token
\G matches a zero-width position at the end of previous match. This is how the regex engine won't attempt a match anywhere in the string.
However, \G also matches at the beggining of the string, and we don't want to match the string "an item...". There's a trick: we used the negative lookahead (?!^) to specify "it's not followed by the start of the text". Therefore, it's guaranteed to match where it left off from the previous match in (1).
(?: and)? is optional, so it may or may not be there.
an? matches the article (an/a).
Do you see that both end up with the same construct? if we join the 2 options together:
(?:^I have:?|(?!^)\G(?: and)?) an? <<the token>>
Let's talk about the token. If it were only one word, we'd use \w+. That's not the case. Neither is .* because it shouldn't match the whole string. And we can't consume part of the following token because otherwise it wouldn't be returned in the next match.
I have a new eraser and a pencil
^
|
How does it stop here?!
How do we define a condition not to allow a match beyond that position?
It's not followed by a/an/and !!!
This can be achieved by a negative lookahead, to guarantee it's not followed by a/an/and before we match a character: (?! a | an | and ).. As you can imagine, that construct will be repeated to match every one of the characters in a token.
This pattern matches what we want: (?:(?! and | an? ).)+
And one more thing, we'll use a capturing group around it to be able to extract the text.
the token = ((?:(?! and | an? ).)+)
First version:
We now have the first working version of the regex. Put together:
(?:^I have:?|(?!^)\G(?: and)?) an? ((?:(?! and | an? ).)+)
Test it in regex101
A few more tricks:
Following the same principle, this approach allows us to include more conditions to the match. For instance,
Not anchored to the start of line.
Without capturing groups, returning each token by with the value of the full match.
Items can be separated with commas.
"I have" could be followed by any word, not necessarily an article, using exceptions.
etc.
What to choose depends on the subjet text, and it should be tested with several examples and corrected until it works as desired.
Solution:
This is the pattern I'd personally use in this case:
(?: # SUBPATTERN 1
\bI have:? # "I have"
(?![ ](?:to|been|\w+?[en]d)\b) # not followed by to|been|\w+[en]d
| # or
(?!\A)\G[ ] # anchored to previous match
?,?(?:[ ]?and)? # optional comma or "and"
) #
#
[ ](?:(?:an?|some)[ ])? # ARTICLE: a|an|some
#
\K # \K (reset match)
#
(?: # SUBPATTERN 2
(?! # Negative lookahead (exceptions)
[ ]*, # a. Comma to list another item
| # b. Article (a|an), some
[ ](?:a(?:nd?)?|some)\b # or and
) #
. # MATCH each character in a token
)+ # REPEAT Subpattern 2
One-liner:
(?:\bI have:?(?! (?:to|been|\w+?[en]d)\b)|(?!\A)\G ?,?(?: ?and)?) (?:(?:an?|some) )?\K(?:(?! *,| (?:a(?:nd?)?|some)\b).)+
Test in regex101
However, it should be tested to identify exceptions and use cases. This is how it behaves with the examples discussed in this post.
Matching the subject text:
Each match has been marked.
I have an eraser, a pencil and an item
She has a turtle and a car
I have an awesome motorcycle tatoo and a bag
I have to say I have a train and a bicycle
I have 3 bricks and 4 knees and a tie
Notice these are full matches, and not the value returned by a group. Simply add a group to enclose the "subpattern 2" to capture the tokens.
Test in regex101

Regular expression for crossword solution

This is a crossword problem. Example:
the solution is a 6-letter word which starts with "r" and ends with "r"
thus the pattern is "r....r"
the unknown 4 letters must be drawn from the pool of letters "a", "e", "i" and "p"
each letter must be used exactly once
we have a large list of candidate 6-letter words
Solutions: "rapier" or "repair".
Filtering for the pattern "r....r" is trivial, but finding words which also have [aeip] in the "unknown" slots is beyond me.
Is this problem amenable to a regex, or must it be done by exhaustive methods?
Try this:
r(?:(?!\1)a()|(?!\2)e()|(?!\3)i()|(?!\4)p()){4}r
...or more readably:
r
(?:
(?!\1) a () |
(?!\2) e () |
(?!\3) i () |
(?!\4) p ()
){4}
r
The empty groups serve as check marks, ticking off each letter as it's consumed. For example, if the word to be matched is repair, the e will be the first letter matched by this construct. If the regex tries to match another e later on, that alternative won't match it. The negative lookahead (?!\2) will fail because group #2 has participated in the match, and never mind that it didn't consume anything.
What's really cool is that it works just as well on strings that contain duplicate letters. Take your redeem example:
r
(?:
(?!\1) e () |
(?!\2) e () |
(?!\3) e () |
(?!\4) d ()
){4}
m
After the first e is consumed, the first alternative is effectively disabled, so the second alternative takes it instead. And so on...
Unfortunately, this technique doesn't work in all regex flavors. For one thing, they don't all treat empty/failed group captures the same. The ECMAScript spec explicitly states that references to non-participating groups should always succeed.
The regex flavor also has to support forward references--that is, backreferences that appear before the groups they refer to in the regex. (ref) It should work in .NET, Java, Perl, PCRE and Ruby, that I know of.
Assuming that you meant that the unknown letters must be among [aeip], then a suitable regex could be:
/r[aeip]{4,4}r/
What's the front end language being used to compare strings, is it java, .net ...
here is an example/psuedo code using java
String mandateLetters = "aeio"
String regPattern = "\\br["+mandateLetters+"]*r$"; // or if for specific length \\br[+mandateLetters+]{4}r$
Pattern pattern = Pattern.compile(regPattern);
Matcher matcher = pattern.matcher("is this repair ");
matcher.find();
Why not replace each '.' in your original pattern with '[aeip]'?
You'd wind up with a regex string r[aeip][aeip][aeip][aeip]r.
This could of course be shortened to r[aeip]{4,4}r, but that would be a pain to implement in the general case and probably wouldn't improve the code any.
This doesn't address the issue of duplicate letter use. If I were coding it, I'd handle that in code outside the regexp - mostly because the regexp would get more complicated than I'd care to handle.
So the "only once" part is the critical thing. Listing all permutations is obviously not feasible. If your language/environment supports lookaheads and backreferences you can make it a bit easier for yourself:
r(?=[aeip]{4,4})(.)(?!\1)(.)(?!\1|\2)(.)(?!\1|\2|\3).r
Still quite ugly, but here is how it works:
r # match an r
(?= # positive lookahead (doesn't advance position of "cursor" in input string)
[aeip]{4,4}
) # make sure that there are the four desired character ahead
(.) # match any character and capture it in group 1
(?!\1)# make sure that the next character is NOT the same as the previous one
(.) # match any character and capture it in group 2
(?!\1|\2)
# make sure that the next character is neither the first nor the second
(.) # match any character and capture it in group 3
(?!\1|\2|\3)
# same thing again for all three characters
. # match another arbitrary character
r # match an r
Working demo.
This is neither really elegant nor scalable. So you might just want to use r([aiep]{4,4})r (capturing the four critical letters) and ensure the additional condition without regex.
EDIT: In fact, the above pattern is only really useful and necessary if you just want to ensure that you have 4 non-identical characters. For your specific case, again using lookaheads, there is simpler (despite longer) solution:
r(?=[^a]*a[^a]*r)(?=[^e]*e[^e]*r)(?=[^i]*i[^i]*r)(?=[^p]*p[^p]*r)[aeip]{4,4}r
Explained:
r # match an r
(?= # lookahead: ensure that there is exactly one a until the next r
[^a]* # match an arbitrary amount of non-a characters
a # match one a
[^a]* # match an arbitrary amount of non-a characters
r # match the final r
) # end of lookahead
(?=[^e]*e[^e]*r) # ensure that there is exactly one e until the next r
(?=[^i]*i[^i]*r) # ensure that there is exactly one i until the next r
(?=[^p]*p[^p]*r) # ensure that there is exactly one p until the next r
[aeip]{4,4}r # actually match the rest to include it in the result
Working demo.
For r....m with a pool of deee, this could be adjusted as:
r(?=[^d]*d[^d]*m)(?=[^e]*(?:e[^e])*{3,3}m)[de]{4,4}m
This ensures that there is exactly one d and exactly 3 es.
Working demo.
not fully regex due to sed multi regex action
sed -n -e '/^r[aiep]\{4,\}r$/{/\([aiep]\).*\1/!p;}' YourFile
take pattern 4 letter in group aeipsourround by r, keep only line where no letter in the sub group is found twice.
A more scalable solution (no need to write \1, \2, \3 and so on for each letter or position) is to use negative lookahead to assert that each character is not occurring later:
^r(?:([aeip])(?!.*\1)){4}r$
more readable as:
^r
(?:
([aeip])
(?!.*\1)
){4}
r$
Improvements
This was a quick solution which works in the situation you gave us, but here are some additional constraints to have a robuster version:
If the "pool of letters" may share some letters with the end of string, include the end of pattern in the lookahead:
^r(?:([aeip])(?!.*\1.*\2)){4}(r$)
(may not work as intended in all regex flavors, in which case copy-paste the end of pattern instead of using \2)
If some letters must be present not only once but a different fixed number of times, add a separate lookahead for all letters sharing this number of times. For instance, "r....r" with one "a" and one "p" but two "e" would be matched by this regex (but "rapper" and "repeer" wouldn't):
^r(?:([ap])(?!.*\1.*\3)|([e])(?!.*\2.*\2.*\3)){4}(r$)
The non-capturing groups now has 2 alternatives: ([ap])(?!.*\1.*\3) which matches "a" or "p" not followed anywhere until ending by another one, and ([e])(?!.*\2.*\2.*\3) which matches "e" not followed anywhere until ending by 2 other ones (so it fails on the first one if there are 3 in total).
BTW this solution includes the above one, but the end of pattern is here shifted to \3 (also, cf. note about flavors).

regex to match entire words containing only certain characters

I want to match entire words (or strings really) that containing only defined characters.
For example if the letters are d, o, g:
dog = match
god = match
ogd = match
dogs = no match (because the string also has an "s" which is not defined)
gods = no match
doog = match
gd = match
In this sentence:
dog god ogd, dogs o
...I would expect to match on dog, god, and o (not ogd, because of the comma or dogs due to the s)
This should work for you
\b[dog]+\b(?![,])
Explanation
r"""
\b # Assert position at a word boundary
[dog] # Match a single character present in the list “dog”
+ # Between one and unlimited times, as many times as possible, giving back as needed (greedy)
\b # Assert position at a word boundary
(?! # Assert that it is impossible to match the regex below starting at this position (negative lookahead)
[,] # Match the character “,”
)
"""
The following regex represents one or more occurrences of the three characters you're looking for:
[dog]+
Explanation:
The square brackets mean: "any of the enclosed characters".
The plus sign means: "one or more occurrences of the previous expression"
This would be the exact same thing:
[ogd]+
Which regex flavor/tool are you using? (e.g. JavaScript, .NET, Notepad++, etc.) If it's one that supports lookahead and lookbehind, you can do this:
(?<!\S)[dog]+(?!\S)
This way, you'll only get matches that are either at the beginning of the string or preceded by whitespace, or at the end of the string or followed by whitespace. If you can't use lookbehind (for example, if you're using JavaScript) you can spell out the leading condition:
(?:^|\s)([dog]+)(?!\S)
In this case you would retrieve the matched word from group #1. But don't take the next step and try to replace the lookahead with (?:$|\s). If you did that, the first hit ("dog") would consume the trailing space, and the regex wouldn't be able to use it to match the next word ("god").
Depending on the language, this should do what you need it to do. It will only match what you said above;
this regex:
[dog]+(?![\w,])
in a string of ..
dog god ogd, dogs o
will only match..
dog, god, and o
Example in javascript
Example in php
Anything between two [](brackets) is a character class.. it will match any character between the brackets. You can also use ranges.. [0-9], [a-z], etc, but it will only match 1 character. The + and * are quantifiers.. the + searches for 1 or more characters, while the * searches for zero or more characters. You can specify an explicit character range with curly brackets({}), putting a digit or multiple digits in-between: {2} will match only 2 characters, while {1,3} will match 1 or 3.
Anything between () parenthesis can be used for callbacks, say you want to return or use the values returned as replacements in the string. The ?! is a negative lookahead, it won't match the character class after it, in order to ensure that strings with the characters are not matched when the characters are present.