Regex Matching with Space - regex

I had a very simple question about regex matching, I want have "string" (ignore case) matched
in this case: "thisisastring", nothing should be returned
in this case: "this is a string" a single match on "string" should be returned
Now I had #"([S|s][T|t][R|r][I|i][N|n][G|g])" as the regex, However it doesn't work correctly in the first case.
How should I write this regex?
Thanks in advance!

[S|s] does not match what you seem to think
Please note that [S|s] does not mean "match a S or a s". It means "match one character that is either a S, a | or a s". That's how things work inside a [character class]. To express an OR, you can use a non-capturing group: (?:S|s). But [Ss] is all you need, and case-insensitivity is even better.
Case-Insensitivity
I'm going to assume we're using case-insensitive mode so we end up with a simpler regex. I assume you're in C# as it looks like you're using a verbatim string: (?i) will work. Another way to set case-insensitivity in C# would be RegexOptions.IgnoreCase
Option 1: boundary (close but no cigar)
(?i)\bstring
This no longer matches string in astring. However, it matches string in ##string, which you do not want.
Option 2: lookbehind
(?i)(?<=[ ])string
The lookbehind ensures that string is preceded by a space character. The brackets are optional, they help see the space.
Option 3: \K (but not in C#)
For engines that support it (Perl, PCRE, Ruby 2+):
(?i)[ ]\Kstring
The \K tells the engine to drop what was matched so far from the final match it returns

Related

How to match with regexp any occurence of a specific char within a string delimited by specific delimiters? [duplicate]

My regex pattern looks something like
<xxxx location="file path/level1/level2" xxxx some="xxx">
I am only interested in the part in quotes assigned to location. Shouldn't it be as easy as below without the greedy switch?
/.*location="(.*)".*/
Does not seem to work.
You need to make your regular expression lazy/non-greedy, because by default, "(.*)" will match all of "file path/level1/level2" xxx some="xxx".
Instead you can make your dot-star non-greedy, which will make it match as few characters as possible:
/location="(.*?)"/
Adding a ? on a quantifier (?, * or +) makes it non-greedy.
Note: this is only available in regex engines which implement the Perl 5 extensions (Java, Ruby, Python, etc) but not in "traditional" regex engines (including Awk, sed, grep without -P, etc.).
location="(.*)" will match from the " after location= until the " after some="xxx unless you make it non-greedy.
So you either need .*? (i.e. make it non-greedy by adding ?) or better replace .* with [^"]*.
[^"] Matches any character except for a " <quotation-mark>
More generic: [^abc] - Matches any character except for an a, b or c
How about
.*location="([^"]*)".*
This avoids the unlimited search with .* and will match exactly to the first quote.
Use non-greedy matching, if your engine supports it. Add the ? inside the capture.
/location="(.*?)"/
Use of Lazy quantifiers ? with no global flag is the answer.
Eg,
If you had global flag /g then, it would have matched all the lowest length matches as below.
Here's another way.
Here's the one you want. This is lazy [\s\S]*?
The first item:
[\s\S]*?(?:location="[^"]*")[\s\S]* Replace with: $1
Explaination: https://regex101.com/r/ZcqcUm/2
For completeness, this gets the last one. This is greedy [\s\S]*
The last item:[\s\S]*(?:location="([^"]*)")[\s\S]*
Replace with: $1
Explaination: https://regex101.com/r/LXSPDp/3
There's only 1 difference between these two regular expressions and that is the ?
The other answers here fail to spell out a full solution for regex versions which don't support non-greedy matching. The greedy quantifiers (.*?, .+? etc) are a Perl 5 extension which isn't supported in traditional regular expressions.
If your stopping condition is a single character, the solution is easy; instead of
a(.*?)b
you can match
a[^ab]*b
i.e specify a character class which excludes the starting and ending delimiiters.
In the more general case, you can painstakingly construct an expression like
start(|[^e]|e(|[^n]|n(|[^d])))end
to capture a match between start and the first occurrence of end. Notice how the subexpression with nested parentheses spells out a number of alternatives which between them allow e only if it isn't followed by nd and so forth, and also take care to cover the empty string as one alternative which doesn't match whatever is disallowed at that particular point.
Of course, the correct approach in most cases is to use a proper parser for the format you are trying to parse, but sometimes, maybe one isn't available, or maybe the specialized tool you are using is insisting on a regular expression and nothing else.
Because you are using quantified subpattern and as descried in Perl Doc,
By default, a quantified subpattern is "greedy", that is, it will
match as many times as possible (given a particular starting location)
while still allowing the rest of the pattern to match. If you want it
to match the minimum number of times possible, follow the quantifier
with a "?" . Note that the meanings don't change, just the
"greediness":
*? //Match 0 or more times, not greedily (minimum matches)
+? //Match 1 or more times, not greedily
Thus, to allow your quantified pattern to make minimum match, follow it by ? :
/location="(.*?)"/
import regex
text = 'ask her to call Mary back when she comes back'
p = r'(?i)(?s)call(.*?)back'
for match in regex.finditer(p, str(text)):
print (match.group(1))
Output:
Mary

I'm trying to create a javascript flavored Regex expression that will returned multiple of a named group [duplicate]

My regex pattern looks something like
<xxxx location="file path/level1/level2" xxxx some="xxx">
I am only interested in the part in quotes assigned to location. Shouldn't it be as easy as below without the greedy switch?
/.*location="(.*)".*/
Does not seem to work.
You need to make your regular expression lazy/non-greedy, because by default, "(.*)" will match all of "file path/level1/level2" xxx some="xxx".
Instead you can make your dot-star non-greedy, which will make it match as few characters as possible:
/location="(.*?)"/
Adding a ? on a quantifier (?, * or +) makes it non-greedy.
Note: this is only available in regex engines which implement the Perl 5 extensions (Java, Ruby, Python, etc) but not in "traditional" regex engines (including Awk, sed, grep without -P, etc.).
location="(.*)" will match from the " after location= until the " after some="xxx unless you make it non-greedy.
So you either need .*? (i.e. make it non-greedy by adding ?) or better replace .* with [^"]*.
[^"] Matches any character except for a " <quotation-mark>
More generic: [^abc] - Matches any character except for an a, b or c
How about
.*location="([^"]*)".*
This avoids the unlimited search with .* and will match exactly to the first quote.
Use non-greedy matching, if your engine supports it. Add the ? inside the capture.
/location="(.*?)"/
Use of Lazy quantifiers ? with no global flag is the answer.
Eg,
If you had global flag /g then, it would have matched all the lowest length matches as below.
Here's another way.
Here's the one you want. This is lazy [\s\S]*?
The first item:
[\s\S]*?(?:location="[^"]*")[\s\S]* Replace with: $1
Explaination: https://regex101.com/r/ZcqcUm/2
For completeness, this gets the last one. This is greedy [\s\S]*
The last item:[\s\S]*(?:location="([^"]*)")[\s\S]*
Replace with: $1
Explaination: https://regex101.com/r/LXSPDp/3
There's only 1 difference between these two regular expressions and that is the ?
The other answers here fail to spell out a full solution for regex versions which don't support non-greedy matching. The greedy quantifiers (.*?, .+? etc) are a Perl 5 extension which isn't supported in traditional regular expressions.
If your stopping condition is a single character, the solution is easy; instead of
a(.*?)b
you can match
a[^ab]*b
i.e specify a character class which excludes the starting and ending delimiiters.
In the more general case, you can painstakingly construct an expression like
start(|[^e]|e(|[^n]|n(|[^d])))end
to capture a match between start and the first occurrence of end. Notice how the subexpression with nested parentheses spells out a number of alternatives which between them allow e only if it isn't followed by nd and so forth, and also take care to cover the empty string as one alternative which doesn't match whatever is disallowed at that particular point.
Of course, the correct approach in most cases is to use a proper parser for the format you are trying to parse, but sometimes, maybe one isn't available, or maybe the specialized tool you are using is insisting on a regular expression and nothing else.
Because you are using quantified subpattern and as descried in Perl Doc,
By default, a quantified subpattern is "greedy", that is, it will
match as many times as possible (given a particular starting location)
while still allowing the rest of the pattern to match. If you want it
to match the minimum number of times possible, follow the quantifier
with a "?" . Note that the meanings don't change, just the
"greediness":
*? //Match 0 or more times, not greedily (minimum matches)
+? //Match 1 or more times, not greedily
Thus, to allow your quantified pattern to make minimum match, follow it by ? :
/location="(.*?)"/
import regex
text = 'ask her to call Mary back when she comes back'
p = r'(?i)(?s)call(.*?)back'
for match in regex.finditer(p, str(text)):
print (match.group(1))
Output:
Mary

Match a specific regex using matches()

Trying to match a specific word using matches()
*//id[matches(.,lower-case('*\s?Xander\s?*'))]
Examples:
Set of Xanderous- No match
Xander Tray of 6- Match
Tray of 6 pieces Xander- Match
Set of 6 Xander pieces- Match
Any instance of the exact word 'Xander' match is the objective.
The reason the XPath regex dialect doesn't handle word boundaries is that to do it properly, you need to be language-sensitive - a "word" is a cultural artefact.
You could do tokenize(., '\P{L}+') = 'Xander' which tokenizes treating any sequence of non-letters as a separator and then tests if one of the tokens is 'Xander'.
I have been running some tests and it seems word boundaries are not integrated into the XML/XPATH vocabulary. So the next best thing IMO is to test for a whitespace or start/end string anchors surrounding zero or more characters. Therefore, I ended up with:
*//id[matches(lower-case(.),'.*(^|\s)xander($|\s).*')]
Even better would be to drop lower-case alltogether and use the third matches parameter (flags) setting it to case-insensitive matching:
*//id[matches(.,'.*(^|\s)xander($|\s).*','i')]
Roughly, if you want to get the full line matching if it exactly contains the word Xander, you can use \b which delimits a specific word, plus some greedy operators .*:
^.*\bXander\b.*$
Demo: https://regex101.com/r/PvKptN/1
Or if you don't need the whole line, you can simply check if it contains Xander:
\bXander\b
Demo: https://regex101.com/r/PvKptN/2
I hope it satisfies the regex flavor you're using

Regex substring matching on capture group

I have an advanced regex question (unless I am overthinking this).
With my basic knowledge of Regex, it is trivial to match static capture group further down in the string.
P(.): D:\1
Correctly matches
Pb: Db
Pa: Da
and (correctly) does not match
Pa: D:b
So far so good. However, what I need to capture is a set of [a-z]+ after the P and match the one character. So that these should also match:
Pabc: D:c
Pabc: D:a
Pba: D:b
Pba: D:a
but not
Pabc: D:x
Pba: D:g
I started going down the path of writing separate patterns like so (spaces added around the alternation for clarity):
P(.): D:\1 | P(.)(.): D:(\1|\2) | P(.)(.)(.): D:(\1|\2|\3)
But I cannot make even this clumsy solution work in Javascript Regex.
Is there an elegant, correct way to do this? Can it be done with Javascript's limited engine?
The following regex will do it:
P.*(.).*: D:\1
.*(.).* will match one or more characters, capturing one of them.
If the captured character matches the character after D:, then the regex matches.
If the captured character doesn't match, backtracking will ensure that it tries again with a different captured character, until all combinations have been tried.
See regex101.com for running example.

How to negate the whole regex?

I have a regex, for example (ma|(t){1}). It matches ma and t and doesn't match bla.
I want to negate the regex, thus it must match bla and not ma and t, by adding something to this regex. I know I can write bla, the actual regex is however more complex.
Use negative lookaround: (?!pattern)
Positive lookarounds can be used to assert that a pattern matches. Negative lookarounds is the opposite: it's used to assert that a pattern DOES NOT match. Some flavor supports assertions; some puts limitations on lookbehind, etc.
Links to regular-expressions.info
Lookahead and Lookbehind Zero-Width Assertions
Flavor comparison
See also
How do I convert CamelCase into human-readable names in Java?
Regex for all strings not containing a string?
A regex to match a substring that isn’t followed by a certain other substring.
More examples
These are attempts to come up with regex solutions to toy problems as exercises; they should be educational if you're trying to learn the various ways you can use lookarounds (nesting them, using them to capture, etc):
codingBat plusOut using regex
codingBat repeatEnd using regex
codingbat wordEnds using regex
Assuming you only want to disallow strings that match the regex completely (i.e., mmbla is okay, but mm isn't), this is what you want:
^(?!(?:m{2}|t)$).*$
(?!(?:m{2}|t)$) is a negative lookahead; it says "starting from the current position, the next few characters are not mm or t, followed by the end of the string." The start anchor (^) at the beginning ensures that the lookahead is applied at the beginning of the string. If that succeeds, the .* goes ahead and consumes the string.
FYI, if you're using Java's matches() method, you don't really need the the ^ and the final $, but they don't do any harm. The $ inside the lookahead is required, though.
\b(?=\w)(?!(ma|(t){1}))\b(\w*)
this is for the given regex.
the \b is to find word boundary.
the positive look ahead (?=\w) is here to avoid spaces.
the negative look ahead over the original regex is to prevent matches of it.
and finally the (\w*) is to catch all the words that are left.
the group that will hold the words is group 3.
the simple (?!pattern) will not work as any sub-string will match
the simple ^(?!(?:m{2}|t)$).*$ will not work as it's granularity is full lines
This regexp math your condition:
^.*(?<!ma|t)$
Look at how it works:
https://regex101.com/r/Ryg2FX/1
Apply this if you use laravel.
Laravel has a not_regex where field under validation must not match the given regular expression; uses the PHP preg_match function internally.
'email' => 'not_regex:/^.+$/i'