How to match periods not at the end of paragraphs? - regex

If I want to find all periods that ARE at the end of paragraphs, I could do \.($|\n). But how can I negate that and say "a period followed by any character that ISN'T one of these, given that metacharacters don't work inside character classes, which stops me using negated character classes?

What's in a $? It depends!
The answer very much depends on which language and regex engine you're using. You see,
In Java, the $ asserts that we are positioned at the end of the string or before any carriage return or newline at the end of the string. So you'd be safe with a \.(?!$)
In PCRE, C# and Python, the $ asserts that we are positioned at the end of the string or before any newline at the end of the string. So you'd could use a \.(?!$|\r)
In JavaScript and Ruby, the $ asserts that we are positioned at the end of the string. So you'd need to go the full Monty with a \.(?!$|[\r\n]).
Therefore, for a multi-engine solution, the safest would be:
\.(?!$|[\r\n])
But in the right context, the other two options are perfectly acceptable.
Explanation
\. matches the literal period
The negative lookahead (?!$|[\r\n]) asserts that what follows is neither the "end of the string" nor a carriage return nor a newline.

Use a Negative Lookahead to do this.
\.(?!\n|$)
Explanation:
\. '.'
(?! look ahead to see if there is not:
\n '\n' (newline)
| OR
$ before an optional \n, and the end of the string
) end of look-ahead
Live Demo

The most useful longhand version of the negatively looked ahead EOL check after the period winds up making your entire pattern something like this:
(?x: # enable comments
\. # a literal dot character
(?! # look ahead for not the following{
\R ? # optional EOL grapheme cluster
\z # at the true end of string
) # } end look ahead
)
That assumes you don’t want it match “interstitially” (that is, before any line-terminator grapheme), which would be the simpler:
(?=\R)
Some argument can be made for that \R? being made into a \R* instead, in case you should happen to have multiple line-terminators at the end of a record, like several newlines in a row. That way 0, 1, 2, or however many EOL graphemes are allowed before the end of the string.
On the other hand, it may well be the case that a paragraph must be at least two EOL graphemes, not just one alone. For example, this is true in markup here and in other files with “blank-line separated” types of paragraphs. So no EOLs are ok, and two or more are too, but not just one of them.
For such text, you would need \R{2,}, but the whole bit would be optionalized, yielding in that case:
(?x: # enable comments
\. # a literal dot character
(?! # look ahead for NOT the following {
(?:
\R {2,} # two or more EOL grapheme clusters
) ? # # optionally
\z # at the true end of string
) # } end negated look ahead
)
If you don’t have \R from UTS 18: Unicode Regular Expressions — Line Boundaries in your regex flavor, then you will have to write it out the hard way, which is the rather annoying:
(?x: # We are emulating \R per UTS#18
(?> # Prohibit backtrack within subpattern
\r \n # Match a CRLF without backtracking
# or else any code point with the
# vertical space character property
# \p{VertSpace}, here enumerated in full
| [\x0A-\x0D\x85\x{2028}\x{2029}]
)
)
You need the no-backtracking bit to avoid something like \R{2} being allowed to match a single CRLF, and it isn’t allowed to do that.
One final thing to consider is whether you want to allow for optional horizontal whitespace to intervene between the period and the EOL. I rather imagine that you do, but without a tighter formal specification in the OP, it’s impossible to say so definitely.

You should use a negative lookahead.
\.(?!$|\n)
More on this: http://www.regular-expressions.info/lookaround.html

Related

regex matching whole with a few criteria

(I want to match the whole line, the purpose is this, in python, I will list all the files in a directory, then I want to pick those file urls based on certain keywords, ie 'qwert2asdf' and 'windows'):
My current regex:
[a-zA-Z0-9_.\-\\]*(qwert2asdf)[a-zA-Z0-9_.\-\\]*(windows)[a-zA-Z0-9_.\-\\]*
matches line #4 which is what I need
4\\abc123-smb.ccabc.com\nfs\site\disks\.abcdefghigk.1234\abcdfff\day.abcdefg_urururur_20140701_1815\nnn-pppp\ass_bss_sw_qwert2asdf_evih_iii_ass_2401_system.abcdefg_urururur_20140701_1815.windows.tar.gz
question 1.is there a better way so I don't have to repeat [a-zA-Z0-9_.-\]*
question 2. how do I make the match so that it ignores the order of 'qwert2asdf' and 'windows', that is if 'windows' happen before 'qwert2asdf' and it'll still match?
1\\abc123-smb.ccabc.com\nfs\site\disks\.abcdefghigk.1234\abcdfff\day.abcdefg_urururur_20140701_1815\nnn-pppp\css_boxt_pkg_isys.abcdefg_urururur_20140701_1815.linux.tar.gz
2\\abc123-smb.ccabc.com\nfs\site\disks\.abcdefghigk.1234\abcdfff\day.abcdefg_urururur_20140701_1815\nnn-pppp\bbb_pkg_all_systems.abcdefg_urururur_20140701_1815.tar.gz
3\\abc123-smb.ccabc.com\nfs\site\disks\.abcdefghigk.1234\abcdfff\day.abcdefg_urururur_20140701_1815\nnn-pppp\ass_bss_sw_qwert2asdf_evih_iii_ass_2401_system.abcdefg_urururur_20140701_1815.linux.tar.gz
4\\abc123-smb.ccabc.com\nfs\site\disks\.abcdefghigk.1234\abcdfff\day.abcdefg_urururur_20140701_1815\nnn-pppp\ass_bss_sw_qwert2asdf_evih_iii_ass_2401_system.abcdefg_urururur_20140701_1815.windows.tar.gz
5\\abc123-smb.ccabc.com\nfs\site\disks\.abcdefghigk.1234\abcdfff\day.abcdefg_urururur_20140701_1815\nnn-pppp\ass_bss_sw_qwert2asdf_evih_iii_ass_2401_system.abcdefg_urururur_20140701_1815_vp.tar.gz
6\\abc123-smb.ccabc.com\nfs\site\disks\.abcdefghigk.1234\abcdfff\day.abcdefg_urururur_20140701_1815\nnn-pppp\ass_bss_sw_evih_iii_ass_2401_system.abcdefg_urururur_20140701_1815.linux.tar.gz
7\\abc123-smb.ccabc.com\nfs\site\disks\.abcdefghigk.1234\abcdfff\day.abcdefg_urururur_20140701_1815\nnn-pppp\ass_bss_sw_evih_iii_ass_2401_system.abcdefg_urururur_20140701_1815.windows.tar.gz
8\\abc123-smb.ccabc.com\nfs\site\disks\.abcdefghigk.1234\abcdfff\day.abcdefg_urururur_20140701_1815\nnn-pppp\ass_bss_sw_evih_iii_ass_2401_system.abcdefg_urururur_20140701_1815_vp.tar.gz
9\\abc123-smb.ccabc.com\nfs\site\disks\.abcdefghigk.1234\abcdfff\day.abcdefg_urururur_20140701_1815\nnn-pppp\ass_bss_sw_evih_iii_ass_system.abcdefg_urururur_20140701_1815.tar.gz
10\\abc123-smb.ccabc.com\nfs\site\disks\.abcdefghigk.1234\abcdfff\day.abcdefg_urururur_20140701_1815\nnn-pppp\doc_pkg_evih_iii_ass_system.abcdefg_urururur_20140701_1815.tar.gz
11\\abc123-smb.ccabc.com\nfs\site\disks\.abcdefghigk.1234\abcdfff\day.abcdefg_urururur_20140701_1815\nnn-pppp\ass_bss_sw_evih_iii_ass_2400_system.abcdefg_urururur_20140701_1815.linux.tar.gz
12\\abc123-smb.ccabc.com\nfs\site\disks\.abcdefghigk.1234\abcdfff\day.abcdefg_urururur_20140701_1815\nnn-pppp\ass_bss_sw_evih_iii_ass_2400_system.abcdefg_urururur_20140701_1815.windows.tar.gz
13\\abc123-smb.ccabc.com\nfs\site\disks\.abcdefghigk.1234\abcdfff\day.abcdefg_urururur_20140701_1815\nnn-pppp\ass_bss_sw_evih_iii_ass_2400_system.abcdefg_urururur_20140701_1815_vp.tar.gz
14\\abc123-smb.ccabc.com\nfs\site\disks\.abcdefghigk.1234\abcdfff\day.abcdefg_urururur_20140701_1815\nnn-pppp\css_pkg_css_skm_cgdsg0_system.abcdefg_urururur_20140701_1815.tar.gz
15\\abc123-smb.ccabc.com\nfs\site\disks\.abcdefghigk.1234\abcdfff\day.abcdefg_urururur_20140701_1815\nnn-pppp\css_pkg_css_skm_asdfgt_system.abcdefg_urururur_20140701_1815.tar.gz
16\\abc123-smb.ccabc.com\nfs\site\disks\.abcdefghigk.1234\abcdfff\day.abcdefg_urururur_20140701_1815\nnn-pppp\ass_bss_sw_boxtppc_evih_iii_ass_2401_system.abcdefg_urururur_20140701_1815.linux.tar.gz
17\\abc123-smb.ccabc.com\nfs\site\disks\.abcdefghigk.1234\abcdfff\day.abcdefg_urururur_20140701_1815\nnn-pppp\ass_bss_sw_boxtppc_evih_iii_ass_2401_system.abcdefg_urururur_20140701_1815.windows.tar.gz
18\\abc123-smb.ccabc.com\nfs\site\disks\.abcdefghigk.1234\abcdfff\day.abcdefg_urururur_20140701_1815\nnn-pppp\ass_bss_sw_boxtppc_evih_iii_ass_2401_system.abcdefg_urururur_20140701_1815_vp.tar.gz
19\\abc123-smb.ccabc.com\nfs\site\disks\.abcdefghigk.1234\abcdfff\day.abcdefg_urururur_20140701_1815\nnn-pppp\ia_css_2.1.3.0.abcdefg_urururur_20140701_1815.linux.tar.gz
20\\abc123-smb.ccabc.com\nfs\site\disks\.abcdefghigk.1234\abcdfff\day.abcdefg_urururur_20140701_1815\nnn-pppp\ia_css_2.1.3.0.abcdefg_urururur_20140701_1815.windows.tar.gz
21\\abc123-smb.ccabc.com\nfs\site\disks\.abcdefghigk.1234\abcdfff\day.abcdefg_urururur_20140701_1815\nnn-pppp\doc_pkg_ia_css_2.1.3.0.abcdefg_urururur_20140701_1815.tar.gz
You can use Positive Lookahead here.
^(?=.*qwert2asdf)(?=.*windows)[\w\\.-]*$
Explanation:
^ # the beginning of the string
(?= # look ahead to see if there is:
.* # any character except \n (0 or more times)
qwert2asdf # 'qwert2asdf'
) # end of look-ahead
(?= # look ahead to see if there is:
.* # any character except \n (0 or more times)
windows # 'windows'
) # end of look-ahead
[\w\\.-]* # any character of: word characters (a-z, A-Z, 0-9, _),
# '\\', '.', '-' (0 or more times)
$ # before an optional \n, and the end of the string
Live Demo
This should work:
(.*?)windows(.*)
You did not say which regex flavor you are using (POSIX, Perl, Java, ...), but I am unaware of any that has a way to write a pattern that matches the same set of strings as yours without repeating the character class as yours does.
You might be tempted to look at back references, but they do not do what you want.
Depending on the host language, however, you might be able to reduce duplication by putting the text of the character class into a variable, and interpolating the variable into your regular expression at each of the three points.
Matching regardless of the order of the 'qwert2asdf' and 'windows' substrings is messier, but it can be done. Here's one way that should work in pretty much any regex engine, modulo any metacharacter (non-)escaping that might need to be performed:
[a-zA-Z0-9_.\-\\]*((qwert2asdf)[a-zA-Z0-9_.\-\\]*(windows)|(windows)[a-zA-Z0-9_.\-\\]*(qwert2asdf))[a-zA-Z0-9_.\-\\]*
A regex engine that supports zero-width lookbehind assertions would provide other alternatives, but I don't think any would come out shorter.

General approach for (equivalent of) "backreferences within character class"?

In Perl regexes, expressions like \1, \2, etc. are usually interpreted as "backreferences" to previously captured groups, but not so when the \1, \2, etc. appear within a character class. In the latter case, the \ is treated as an escape character (and therefore \1 is just 1, etc.).
Therefore, if (for example) one wanted to match a string (of length greater than 1) whose first character matches its last character, but does not appear anywhere else in the string, the following regex will not do:
/\A # match beginning of string;
(.) # match and capture first character (referred to subsequently by \1);
[^\1]* # (WRONG) match zero or more characters different from character in \1;
\1 # match \1;
\z # match the end of the string;
/sx # s: let . match newline; x: ignore whitespace, allow comments
would not work, since it matches (for example) the string 'a1a2a':
DB<1> ( 'a1a2a' =~ /\A(.)[^\1]*\1\z/ and print "fail!" ) or print "success!"
fail!
I can usually manage to find some workaround1, but it's always rather problem-specific, and usually far more complicated-looking than what I would do if I could use backreferences within a character class.
Is there a general (and hopefully straightforward) workaround?
1 For example, for the problem in the example above, I'd use something like
/\A
(.) # match and capture first character (referred to subsequently
# by \1);
(?!.*\1\.+\z) # a negative lookahead assertion for "a suffix containing \1";
.* # substring not containing \1 (as guaranteed by the preceding
# negative lookahead assertion);
\1\z # match last character only if it is equal to the first one
/sx
...where I've replaced the reasonably straightforward (though, alas, incorrect) subexpression [^\1]* in the earlier regex with the somewhat more forbidding negative lookahead assertion (?!.*\1.+\z). This assertion basically says "give up if \1 appears anywhere beyond this point (other than at the last position)." Incidentally, I give this solution just to illustrate the sort of workarounds I referred to in the question. I don't claim that it is a particularly good one.
This can be accomplished with a negative lookahead within a repeated group:
/\A # match beginning of string;
(.) # match and capture first character (referred to subsequently by \1);
((?!\1).)* # match zero or more characters different from character in \1;
\1 # match \1;
\z # match the end of the string;
/sx
This pattern can be used even if the group contains more than one character.

Regex.Replace formatting a query

I am working in VB.Net and trying to use Regex.Replace to format a string I am using to query Sql. What Im going for is to cut out comments "--". I've found that in most cases the below works for what I need.
string = Regex.Replace(command, "--.*\n", "")
and
string = Regex.Replace(command, "--.*$", "")
However I have ran into a problem. If I have a string inside of my query that contains the double dash string it doesn't work, the replace will just cut out the whole line starting at the double dash. It makes since to me as to why but I can't figure out the regular expression i need to match on.
logically I need to match on a string that starts with "--" and is not proceeded by "'" and not followed by "'" with any number of characters inbetween. But Im not sure how to express that in a regular expression. I have tried variations of:
string = Regex.Replace(cmd, "[^('.*)]--.*\n[^(.*')]", "")
Which I know is obviously wrong. I have looked at a couple of online resources including http://www.codeproject.com/KB/dotnet/regextutorial.aspx
but due to my lack of understanding I can't seem to figure this one out.
I think you meant "match on a string that starts with -- and is not proceededpreceeded by ' and not followed by ' with any number of characters inbetween"
If so, then this is what you are looking for:
string = Regex.Replace(cmd, "(?<!'.*?--)--(?!.*?').*(?=\r\n)", "")
'EDIT: modified a little
Of course, it means you can't have apostrophes in your comments... and would be exceedingly easy to hack if someone wanted to (you aren't thinking of using this to protect against injection attacks, are you? ARE YOU!??! :D )
I can break down the expression if you'd like, but it's essentially the same as my modified quote above!
EDIT:
I modified the expression a little, so it does not consume any carriage return, only the comment itself... the expression says:
(?<! # negative lookbehind assertion*
' # match a literal single quote
.*? # followed by anything (reluctantly*)
-- # two literal dashes
) # end assertion
-- # match two literal dashes
(?! # negative lookahead assertion
.*? # match anything (reluctant)
' # followed by a literal single quote
) # end assertion
.* # match anything
(?= # positive lookahead assertion
\r\n # match carriage-return, line-feed
) # end assertion
negative lookbehind assertion means at this point in the match, look backward here and assert that this cannot be matched
negative lookahead assertion means look forward from this point and assert this cannot be matched
positive lookahead asserts the following expression CAN be matched
reluctant means only consume a match for the previous atom (the . which means everything in this case) if you cannot match the expression that follows. Thus the .*? in .*?-- (when applied against the string abc--) will consume a, then check to see if the -- can be matched and fail; it will then consume ab, but stop again to see if the -- can be matched and fail; once it consumes abc and the -- can be matched (success), it will finally consume the entire abc--
non-reluctant or "greedy" which would be .* without the ? will match abc-- with the .*, then try to match the end of the string with -- and fail; it will then backtrack until it can match the --
one additional note is that the . "anything" does not by default include newlines (carriage-return/line-feed), which is needed for this to work properly (there is a switch that will allow . to match newlines and it will break this expression)
A good resource - where I've learned 90% of what I know about regex - is Regular-Expressions.info
Tread carefully and good luck!
OK what you are doing here is not right :
/[^('.*)]--.*\n[^(.*')]/
You are saying the following :
Do not match a (, ), ', ., * then match -- then match anything until a newline and to not match the same character class as the one at the start.
What you probably meant to do is this :
/(?<!['"])\s*--.*[\r\n]*/
Which says, make sure that you don't match a ' or " match any whitespace match -- and anything else until the end or a newline or line feed character.

Nested regex lookahead and lookbehind

I am having problems with the nested '+'/'-' lookahead/lookbehind in regex.
Let's say that I want to change the '*' in a string with '%' and let's say that '\' escapes the next character. (Turning a regex to sql like command ^^).
So the string
'*test*' should be changed to '%test%',
'\\*test\\*' -> '\\%test\\%', but
'\*test\*' and '\\\*test\\\*' should stay the same.
I tried:
(?<!\\)(?=\\\\)*\* but this doesn't work
(?<!\\)((?=\\\\)*\*) ...
(?<!\\(?=\\\\)*)\* ...
(?=(?<!\\)(?=\\\\)*)\* ...
What is the correct regex that will match the '*'s in examples given above?
What is the difference between (?<!\\(?=\\\\)*)\* and (?=(?<!\\)(?=\\\\)*)\* or if these are essentially wrong the difference between regex that have such a visual construction?
To find an unescaped character, you would look for a character that is preceded by an even number of (or zero) escape characters. This is relatively straight-forward.
(?<=(?<!\\)(?:\\\\)*)\* # this is explained in Tim Pietzcker' answer
Unfortunately, many regex engines do not support variable-length look-behind, so we have to substitute with look-ahead:
(?=(?<!\\)(?:\\\\)*\*)(\\*)\* # also look at ridgerunner's improved version
Replace this with the contents of group 1 and a % sign.
Explanation
(?= # start look-ahead
(?<!\\) # a position not preceded by a backslash (via look-behind)
(?:\\\\)* # an even number of backslashes (don't capture them)
\* # a star
) # end look-ahead. If found,
( # start group 1
\\* # match any number of backslashes in front of the star
) # end group 1
\* # match the star itself
The look-ahead makes sure only even numbers of backslashes are taken into account. Anyway, there is no way around matching them into a group, since the look-ahead does not advance the position in the string.
Ok, since Tim decided to not update his regex with my suggested mods (and Tomalak's answer is not as streamlined), here is my recommended solution:
Replace: ((?<!\\)(?:\\\\)*)\* with $1%
Here it is in the form of a commented PHP snippett:
// Replace all non-escaped asterisks with "%".
$re = '% # Match non-escaped asterisks.
( # $1: Any/all preceding escaped backslashes.
(?<!\\\\) # At a position not preceded by a backslash,
(?:\\\\\\\\)* # Match zero or more escaped backslashes.
) # End $1: Any preceding escaped backslashes.
\* # Unescaped literal asterisk.
%x';
$text = preg_replace($re, '$1%', $text);
Addendum: Non-lookaround JavaScript Solution
The above solution does require lookbehind, so it will not work in JavaScript. The following JavaScript solution does not use lookbehind:
text = text.replace(/(\\[\S\s])|\*/g,
function(m0, m1) {
return m1 ? m1 : '%';
});
This solution replaces each instance of backslash-anything with itself, and each instance of * asterisk with a % percent sign.
Edit 2011-10-24: Fixed Javascript version to correctly handle cases such as: **text**. (Thanks to Alan Moore for pointing out the error in previous version.)
Others have shown how this can be done with a lookbehind, but I'd like to make a case for not using lookarounds at all. Consider this solution (demo here):
s/\G([^*\\]*(?:\\.[^*\\]*)*)\*/$1%/g;
The bulk of the regex, [^*\\]*(?:\\.[^*\\]*)*, is an example of Friedl's "unrolled loop" idiom. It consumes as many as it can of individual characters other than asterisk or backslash, or pairs of characters consisting of a backslash followed by anything. That allows it to avoid consuming unescaped asterisks, no matter how many escaped backslashes (or other characters) precede them.
The \G anchors each match to the position where the previous match ended, or to the beginning of the input if this is the first match attempt. This prevents the regex engine from simply skipping over escaped backslashes and matching the unescaped asterisks anyway. So, each iteration of the /g controlled match consumes everything up to the next unescaped asterisk, capturing all but the asterisk in group #1. Then that's plugged back in and the * is replaced with %.
I think this is at least as readable as the lookaround approaches, and easier to understand. It does require support for \G, so it won't work in JavaScript or Python, but it works just fine in Perl.
So you essentially want to match * only if it's preceded by an even number of backslashes (or, in other words, if it isn't escaped)? Then you don't need lookahead at all since you're only looking back, aren't you?
Search for
(?<=(?<!\\)(?:\\\\)*)\*
and replace with %.
Explanation:
(?<= # Assert that it's possible to match before the current position...
(?<!\\) # (unless there are more backslashes before that)
(?:\\\\)* # an even number of backslashes
) # End of lookbehind
\* # Then match an asterisk
The problem of detecting escaped backslashes in regex has fascinated me for a while, and it wasn't until recently that I realized I was completely overcomplicating it. There are a couple of things that make it simpler, and as far as I can tell nobody here has noticed them yet:
Backslashes escape any character after them, not just other backslashes. So (\\.)* will eat an entire chain of escaped characters, whether they're backslashes or not. You don't have to worry about even- or odd-numbered slashes; just check for a solitary \ at the beginning or end of the chain (ridgerunner's JavaScript solution does take advantage of this).
Lookarounds aren't the only way to make sure you start with the first backslash in a chain. You can just look for a non-backslash character (or the start of the string).
The result is a short, simple pattern that doesn't need lookarounds or callbacks, and it's shorter than anything else I see so far.
/(?!<\\)(\\.)*\*/g
And the replacement string:
"$1%"
This works in .NET, which allows lookbehinds, and it should work for you in Perl. It's possible to do it in JavaScript, but without lookbehinds or the \G anchor, I can't see a way to do it in a one-liner. Ridgerunner's callback should work, as will a loop:
var regx = /(^|[^\\])(\\.)*\*/g;
while (input.match(regx)) {
input = input.replace(regx, '$1$2%');
}
There are a lot of names here I recognize from other regex questions, and I know some of you are smarter than me. If I've made a mistake, please say so.

Regex that says what NOT to match?

I’m wondering how to match any characters except for a particular string (call it "for") in a regex.
I was thinking maybe it was something like this: [^for]* — except that that doesn’t work.
I’m sure this a dup.
One way is to start your pattern with a lookahead like this:
(?=\A(?s:(?!for).)*\z)
That can be written like this in any regex system worth bothering with:
(?x) # turn /x mode on for commentary & spacing
(?= # lookahead assertion; hence nonconsumptive
\A # beginning of string
(?s: # begin atomic group for later quantification
# enable /s mode so dot can cross lines
(?! for ) # lookahead negation: ain't no "for" here
. # but there is any one single code point
) # end of "for"-negated anything-dot
* # repeat that group zero or more times, greedily
\z # until we reach the very end of the string
) # end of lookahead
Now just put that in the front of your pattern, and add whatever else you’d like afterwords. That’s how you express the logic !/for/ && ⋯ when you have to built such knowledge into the pattern.
It is similar to how you construct /foo/ && /bar/ && /glarch/ when you have to put it in a single pattern, which is
(?=\A(?s:.)*foo)(?=\A(?s:.)*bar)(?=\A(?s:.)*glarch)
^(?!for$).*$
matches any string except for.
^(?!.*for).*$
matches any string that doesn't contain for.
^(?!.*\bfor\b).*$
matches any string that doesn't contain for as a complete word, but allows words like forceps.
You can try to check whether the string matches for, and negate the result, in whatever language you use (e.g. if (not $_ =~ m/for/) in Perl)