General approach for (equivalent of) "backreferences within character class"? - regex

In Perl regexes, expressions like \1, \2, etc. are usually interpreted as "backreferences" to previously captured groups, but not so when the \1, \2, etc. appear within a character class. In the latter case, the \ is treated as an escape character (and therefore \1 is just 1, etc.).
Therefore, if (for example) one wanted to match a string (of length greater than 1) whose first character matches its last character, but does not appear anywhere else in the string, the following regex will not do:
/\A # match beginning of string;
(.) # match and capture first character (referred to subsequently by \1);
[^\1]* # (WRONG) match zero or more characters different from character in \1;
\1 # match \1;
\z # match the end of the string;
/sx # s: let . match newline; x: ignore whitespace, allow comments
would not work, since it matches (for example) the string 'a1a2a':
DB<1> ( 'a1a2a' =~ /\A(.)[^\1]*\1\z/ and print "fail!" ) or print "success!"
fail!
I can usually manage to find some workaround1, but it's always rather problem-specific, and usually far more complicated-looking than what I would do if I could use backreferences within a character class.
Is there a general (and hopefully straightforward) workaround?
1 For example, for the problem in the example above, I'd use something like
/\A
(.) # match and capture first character (referred to subsequently
# by \1);
(?!.*\1\.+\z) # a negative lookahead assertion for "a suffix containing \1";
.* # substring not containing \1 (as guaranteed by the preceding
# negative lookahead assertion);
\1\z # match last character only if it is equal to the first one
/sx
...where I've replaced the reasonably straightforward (though, alas, incorrect) subexpression [^\1]* in the earlier regex with the somewhat more forbidding negative lookahead assertion (?!.*\1.+\z). This assertion basically says "give up if \1 appears anywhere beyond this point (other than at the last position)." Incidentally, I give this solution just to illustrate the sort of workarounds I referred to in the question. I don't claim that it is a particularly good one.

This can be accomplished with a negative lookahead within a repeated group:
/\A # match beginning of string;
(.) # match and capture first character (referred to subsequently by \1);
((?!\1).)* # match zero or more characters different from character in \1;
\1 # match \1;
\z # match the end of the string;
/sx
This pattern can be used even if the group contains more than one character.

Related

Regex Pattern to Match except when the clause enclosed by the tilde (~) on both sides

I want to extract matches of the clauses match-this that is enclosed with anything other than the tilde (~) in the string.
For example, in this string:
match-this~match-this~ match-this ~match-this#match-this~match-this~match-this
There should be 5 matches from above. The matches are explained below (enclosed by []):
Either match-this~ or match-this is correct for first match.
match-this is correct for 2nd match.
Either ~match-this# or ~match-this is correct for 3rd match.
Either #match-this~ or #match-this or match-this~ is correct for 4th match.
Either ~match-this or match-this is correct for 5th match.
I can use the pattern ~match-this~ catch these ~match-this~, but when I tried the negation of it (?!(~match-this)), it literally catches all nulls.
When I tried the pattern [^~]match-this[^~], it catches only one match (the 2nd match from above). And when I tried to add asterisk wild card on any negation of tilde, either [^~]match-this[^~]* or [^~]*match-this[^~], I got only 2 matches. When I put the asterisk wild card on both, it catches all match-this including those which enclosed by tildes ~.
Is it possible to achieve this with only one regex test? Or Does it need more??
If you also want to match #match-this~ as a separate match, you would have to account for # while matching, as [^~] also matches #
You could match what you don't want, and capture in a group what you want to keep.
~[^~#]*~|((?:(?!match-this).)*match-this(?:(?!match-this)[^#~])*)
Explanation
~[^~#]*~ Match any char except ~ or # between ~
| Or
( Capture group 1
(?:(?!match-this).)* Match any char if not directly followed by *match-this~
match-this Match literally
(?:(?!match-this)[^#~])* Match any char except ~ or # if not directly followed by match this
) Close group 1
See a regex demo and a Python demo.
Example
import re
pattern = r"~[^~#]*~|((?:(?!match-this).)*match-this(?:(?!match-this)[^#~])*)"
s = "match-this~match-this~ match-this ~match-this#match-this~match-this~match-this"
res = [m for m in re.findall(pattern, s) if m]
print (res)
Output
['match-this', ' match-this ', '~match-this', '#match-this', 'match-this']
If all five matches can be "match-this" (contradicting the requirement for the 3rd match) you can match the regular expression
~match-this~|(\bmatch-this\b)
and keep only matches that are captured (to capture group 1). The idea is to discard matches that are not captured and keep matches that are captured. When the regex engine matches "~match-this~" its internal string pointer is moved just past the closing "~", thereby skipping an unwanted substring.
Demo
The regular expression can be broken down as follows.
~match-this~ # match literal
| # or
( # begin capture group 1
\b # match a word boundary
match-this # match literal
\b # match a word boundary
) # end capture group 1
Being so simple, this regular expression would be supported by most regex engines.
For this you need both kinds of lookarounds. This will match the 5 spots you want, and there's a reason why it only works this way and not another and why the prefix and/or suffix can't be included:
(?<=~)match-this(?!~)|(?<!~)match-this(?=~)|(?<!~)match-this(?!~)
Explaining lookarounds:
(?=...) is a positive lookahead: what comes next must match
(?!...) is a negative lookahead: what comes next must not match
(?<=...) is a positive lookbehind: what comes before must match
(?<!...) is a negative lookbehind: what comes before must not match
Why other ways won't work:
[^~] is a class with negation, but it always needs one character to be there and also consumes that character for the match itself. The former is a problem for a starting text. The latter is a problem for having advanced too far, so a "don't match" character is gone already.
(^|[^~]) would solve the first problem: either the text starts or it must be a character not matching this. We could do the same for ending texts, but this is a dead again anyway.
Only lookarounds remain, and even then we have to code all 3 variants, hence the two |.
As per the nature of lookarounds the character in front or behind cannot be captured. Additionally if you want to also match either a leading or a trailing character then this collides with recognizing the next potential match.
It's a difference between telling the engine to "not match" a character and to tell the engine to "look out" for something without actually consuming characters and advancing the current position in the text. Also not every regex engine supports all lookarounds, so it matters where you actually want to use it. For me it works fine in TextPad 8 and should also work fine in PCRE (f.e. in PHP). As per regex101.com/r/CjcaWQ/1 it also works as expected by me.
What irritates me: if the leading and/or trailing character of a found match is important to you, then just extract it from the input when processing all the matches, since they also come with starting positions and lengths: first match at position 0 for 10 characters means you look at input text position -1 and 10.

How to match periods not at the end of paragraphs?

If I want to find all periods that ARE at the end of paragraphs, I could do \.($|\n). But how can I negate that and say "a period followed by any character that ISN'T one of these, given that metacharacters don't work inside character classes, which stops me using negated character classes?
What's in a $? It depends!
The answer very much depends on which language and regex engine you're using. You see,
In Java, the $ asserts that we are positioned at the end of the string or before any carriage return or newline at the end of the string. So you'd be safe with a \.(?!$)
In PCRE, C# and Python, the $ asserts that we are positioned at the end of the string or before any newline at the end of the string. So you'd could use a \.(?!$|\r)
In JavaScript and Ruby, the $ asserts that we are positioned at the end of the string. So you'd need to go the full Monty with a \.(?!$|[\r\n]).
Therefore, for a multi-engine solution, the safest would be:
\.(?!$|[\r\n])
But in the right context, the other two options are perfectly acceptable.
Explanation
\. matches the literal period
The negative lookahead (?!$|[\r\n]) asserts that what follows is neither the "end of the string" nor a carriage return nor a newline.
Use a Negative Lookahead to do this.
\.(?!\n|$)
Explanation:
\. '.'
(?! look ahead to see if there is not:
\n '\n' (newline)
| OR
$ before an optional \n, and the end of the string
) end of look-ahead
Live Demo
The most useful longhand version of the negatively looked ahead EOL check after the period winds up making your entire pattern something like this:
(?x: # enable comments
\. # a literal dot character
(?! # look ahead for not the following{
\R ? # optional EOL grapheme cluster
\z # at the true end of string
) # } end look ahead
)
That assumes you don’t want it match “interstitially” (that is, before any line-terminator grapheme), which would be the simpler:
(?=\R)
Some argument can be made for that \R? being made into a \R* instead, in case you should happen to have multiple line-terminators at the end of a record, like several newlines in a row. That way 0, 1, 2, or however many EOL graphemes are allowed before the end of the string.
On the other hand, it may well be the case that a paragraph must be at least two EOL graphemes, not just one alone. For example, this is true in markup here and in other files with “blank-line separated” types of paragraphs. So no EOLs are ok, and two or more are too, but not just one of them.
For such text, you would need \R{2,}, but the whole bit would be optionalized, yielding in that case:
(?x: # enable comments
\. # a literal dot character
(?! # look ahead for NOT the following {
(?:
\R {2,} # two or more EOL grapheme clusters
) ? # # optionally
\z # at the true end of string
) # } end negated look ahead
)
If you don’t have \R from UTS 18: Unicode Regular Expressions — Line Boundaries in your regex flavor, then you will have to write it out the hard way, which is the rather annoying:
(?x: # We are emulating \R per UTS#18
(?> # Prohibit backtrack within subpattern
\r \n # Match a CRLF without backtracking
# or else any code point with the
# vertical space character property
# \p{VertSpace}, here enumerated in full
| [\x0A-\x0D\x85\x{2028}\x{2029}]
)
)
You need the no-backtracking bit to avoid something like \R{2} being allowed to match a single CRLF, and it isn’t allowed to do that.
One final thing to consider is whether you want to allow for optional horizontal whitespace to intervene between the period and the EOL. I rather imagine that you do, but without a tighter formal specification in the OP, it’s impossible to say so definitely.
You should use a negative lookahead.
\.(?!$|\n)
More on this: http://www.regular-expressions.info/lookaround.html

Perl: Matching string not containing PATTERN

While using Perl regex to chop a string down into usable pieces I had the need to match everything except a certain pattern. I solved it after I found this hint on Perl Monks:
/^(?:(?!PATTERN).)*$/; # Matches strings not containing PATTERN
Although I solved my initial problem, I have little clue about how it actually works. I checked perlre, but it is a bit too formal to grasp.
Regular expression to match a line that doesn't contain a word? helps a lot in understanding, but why is the . in my example and the ?: and how do the outer parentheses work?
Can someone break up the regex and explain in simple words how it works?
Building it up piece by piece (and throughout assuming no newlines in the string or PATTERN):
This matches any string:
/^.*$/
But we don't want . to match a character that starts PATTERN, so replace
.
with
(?!PATTERN).
This uses a negative look-ahead that tests a given pattern without actually consuming any of the string and only succeeds if the pattern does not match at the given point in the string. So it's like saying:
if PATTERN doesn't match at this point,
match the next character
This needs to be done for every character in the string, so * is used to match zero or more times, from the beginning to the end of the string.
To make the * apply to the combination of the negative look-ahead and ., not just the ., it needs to be surrounded by parentheses, and since there's no reason to capture, they should be non-capturing parentheses (?: ):
(?:(?!PATTERN).)*
And putting back the anchors to make sure we test at every position in the string:
/^(?:(?!PATTERN).)*$/
Note that this solution is particularly useful as part of a larger match; e.g. to match any string with foo and later baz but no bar in between:
/foo(?:(?!bar).)*baz/
If there aren't such considerations, you can simply do:
/^(?!.*PATTERN)/
to check that PATTERN does not match anywhere in the string.
About newlines: there are two problems with your regex and newlines. First, . doesn't match newlines, so "foo\nbar" =~ /^(?:(?!baz).)*$/ doesn't match, even though the string does not contain baz. You need to add the /s flag to make . match any character; "foo\nbar" =~ /^(?:(?!baz).)*$/s correctly matches. Second, $ doesn't match just at the end of the string, it also can match before a newline at the end of the string. So "foo\n" =~ /^(?:(?!\s).)*$/s does match, even though the string contains whitespace and you are attempting to only match strings with no whitespace; \z always only matches at the end, so "foo\n" =~ /^(?:(?!\s).)*\z/s correctly fails to match the string that does in fact contain a \s. So the correct general purpose regex is:
/^(?:(?!PATTERN).)*\z/s
jippie, first, here's a tip. If you see a regex that is not immediately obvious to you, you can dump it in a tool that explains every token.
For instance, here is the RegexBuddy output:
"
^ # Assert position at the beginning of a line (at beginning of the string or after a line break character) (line feed)
(?: # Match the regular expression below
(?! # Assert that it is impossible to match the regex below starting at this position (negative lookahead)
PATTERN # Match the character string “PATTERN” literally (case insensitive)
)
. # Match any single character that is NOT a line break character (line feed)
)
* # Between zero and unlimited times, as many times as possible, giving back as needed (greedy)
\$ # Assert position at the end of a line (at the end of the string or before a line break character) (line feed)
# Perl 5.18 allows a zero-length match at the position where the previous match ends.
# Perl 5.18 attempts the next match at the same position as the previous match if it was zero-length and may find a non-zero-length match at the same position.
"
Some people also use regex101.
A Human Explanation
Now if I had to explain the regex, I would not be so linear. I would start by saying that it is fully anchored by the ^ and the $, implying that the only possible match is the whole string, not a substring of that string.
Then we come to the meat: a non-capturing group introduced by (?: and repeated any number of times by the *
What does this group do? It contains
a negative lookahead (you may want to read up on lookarounds here) asserting that at this exact position in the string, we cannot match the word PATTERN,
then a dot to match the next character
This means that at each position in the string, we assert that we cannot match PATTERN, then we match the next character.
If PATTERN can be matched anywhere, the negative lookahead fails, and so does the entire regex.

Regex.Replace formatting a query

I am working in VB.Net and trying to use Regex.Replace to format a string I am using to query Sql. What Im going for is to cut out comments "--". I've found that in most cases the below works for what I need.
string = Regex.Replace(command, "--.*\n", "")
and
string = Regex.Replace(command, "--.*$", "")
However I have ran into a problem. If I have a string inside of my query that contains the double dash string it doesn't work, the replace will just cut out the whole line starting at the double dash. It makes since to me as to why but I can't figure out the regular expression i need to match on.
logically I need to match on a string that starts with "--" and is not proceeded by "'" and not followed by "'" with any number of characters inbetween. But Im not sure how to express that in a regular expression. I have tried variations of:
string = Regex.Replace(cmd, "[^('.*)]--.*\n[^(.*')]", "")
Which I know is obviously wrong. I have looked at a couple of online resources including http://www.codeproject.com/KB/dotnet/regextutorial.aspx
but due to my lack of understanding I can't seem to figure this one out.
I think you meant "match on a string that starts with -- and is not proceededpreceeded by ' and not followed by ' with any number of characters inbetween"
If so, then this is what you are looking for:
string = Regex.Replace(cmd, "(?<!'.*?--)--(?!.*?').*(?=\r\n)", "")
'EDIT: modified a little
Of course, it means you can't have apostrophes in your comments... and would be exceedingly easy to hack if someone wanted to (you aren't thinking of using this to protect against injection attacks, are you? ARE YOU!??! :D )
I can break down the expression if you'd like, but it's essentially the same as my modified quote above!
EDIT:
I modified the expression a little, so it does not consume any carriage return, only the comment itself... the expression says:
(?<! # negative lookbehind assertion*
' # match a literal single quote
.*? # followed by anything (reluctantly*)
-- # two literal dashes
) # end assertion
-- # match two literal dashes
(?! # negative lookahead assertion
.*? # match anything (reluctant)
' # followed by a literal single quote
) # end assertion
.* # match anything
(?= # positive lookahead assertion
\r\n # match carriage-return, line-feed
) # end assertion
negative lookbehind assertion means at this point in the match, look backward here and assert that this cannot be matched
negative lookahead assertion means look forward from this point and assert this cannot be matched
positive lookahead asserts the following expression CAN be matched
reluctant means only consume a match for the previous atom (the . which means everything in this case) if you cannot match the expression that follows. Thus the .*? in .*?-- (when applied against the string abc--) will consume a, then check to see if the -- can be matched and fail; it will then consume ab, but stop again to see if the -- can be matched and fail; once it consumes abc and the -- can be matched (success), it will finally consume the entire abc--
non-reluctant or "greedy" which would be .* without the ? will match abc-- with the .*, then try to match the end of the string with -- and fail; it will then backtrack until it can match the --
one additional note is that the . "anything" does not by default include newlines (carriage-return/line-feed), which is needed for this to work properly (there is a switch that will allow . to match newlines and it will break this expression)
A good resource - where I've learned 90% of what I know about regex - is Regular-Expressions.info
Tread carefully and good luck!
OK what you are doing here is not right :
/[^('.*)]--.*\n[^(.*')]/
You are saying the following :
Do not match a (, ), ', ., * then match -- then match anything until a newline and to not match the same character class as the one at the start.
What you probably meant to do is this :
/(?<!['"])\s*--.*[\r\n]*/
Which says, make sure that you don't match a ' or " match any whitespace match -- and anything else until the end or a newline or line feed character.

Regex: what does (?! ...) mean?

The following regex finds text between substrings FTW and ODP.
/FTW(((?!FTW|ODP).)+)ODP+/
What does the (?!...) do?
(?!regex) is a zero-width negative lookahead. It will test the characters at the current cursor position and forward, testing that they do NOT match the supplied regex, and then return the cursor back to where it started.
The whole regexp:
/
FTW # Match Characters 'FTW'
( # Start Match Group 1
( # Start Match Group 2
(?!FTW|ODP) # Ensure next characters are NOT 'FTW' or 'ODP', without matching
. # Match one character
)+ # End Match Group 2, Match One or More times
) # End Match Group 1
OD # Match characters 'OD'
P+ # Match 'P' One or More times
/
So - Hunt for FTW, then capture while looking for ODP+ to end our string. Also ensure that the data between FTW and ODP+ doesn't contain FTW or ODP
From perldoc:
A zero-width negative look-ahead assertion. For example /foo(?!bar)/ matches any occurrence of "foo" that isn't followed by "bar". Note however that look-ahead and look-behind are NOT the same thing. You cannot use this for look-behind.
If you are looking for a "bar" that isn't preceded by a "foo", /(?!foo)bar/ will not do what you want. That's because the (?!foo) is just saying that the next thing cannot be "foo"--and it's not, it's a "bar", so "foobar" will match. You would have to do something like /(?!foo)...bar/ for that. We say "like" because there's the case of your "bar" not having three characters before it. You could cover that this way: /(?:(?!foo)...|^.{0,2})bar/ . Sometimes it's still easier just to say:
if (/bar/ && $` !~ /foo$/)
It means "not followed by...". Technically this is what's called a negative lookahead in that you can peek at what's ahead in the string without capturing it. It is a class of zero-width assertion, meaning that such expressions don't capture any part of the expression.
The programmer must have been typing too fast. Some characters in the pattern got flipped. Corrected:
/WTF(((?!WTF|ODP).)+)ODP+/
Regex
/FTW(((?!FTW|ODP).)+)ODP+/
matches first FTW immediately followed neither by FTW nor by ODP, then all following chars up to the first ODP (but if there is FTW somewhere in them there will be no match) then all the letters P that follow.
So in the string:
FTWFTWODPFTWjjFTWjjODPPPPjjODPPPjjj
it will match the bold part
FTWFTWODPFTWjjFTWjjODPPPPjjODPPPjjj
'?!' is actually part of '(?! ... )', it means whatever is inside must NOT match at that location.