I’m wondering how to match any characters except for a particular string (call it "for") in a regex.
I was thinking maybe it was something like this: [^for]* — except that that doesn’t work.
I’m sure this a dup.
One way is to start your pattern with a lookahead like this:
(?=\A(?s:(?!for).)*\z)
That can be written like this in any regex system worth bothering with:
(?x) # turn /x mode on for commentary & spacing
(?= # lookahead assertion; hence nonconsumptive
\A # beginning of string
(?s: # begin atomic group for later quantification
# enable /s mode so dot can cross lines
(?! for ) # lookahead negation: ain't no "for" here
. # but there is any one single code point
) # end of "for"-negated anything-dot
* # repeat that group zero or more times, greedily
\z # until we reach the very end of the string
) # end of lookahead
Now just put that in the front of your pattern, and add whatever else you’d like afterwords. That’s how you express the logic !/for/ && ⋯ when you have to built such knowledge into the pattern.
It is similar to how you construct /foo/ && /bar/ && /glarch/ when you have to put it in a single pattern, which is
(?=\A(?s:.)*foo)(?=\A(?s:.)*bar)(?=\A(?s:.)*glarch)
^(?!for$).*$
matches any string except for.
^(?!.*for).*$
matches any string that doesn't contain for.
^(?!.*\bfor\b).*$
matches any string that doesn't contain for as a complete word, but allows words like forceps.
You can try to check whether the string matches for, and negate the result, in whatever language you use (e.g. if (not $_ =~ m/for/) in Perl)
Related
(I want to match the whole line, the purpose is this, in python, I will list all the files in a directory, then I want to pick those file urls based on certain keywords, ie 'qwert2asdf' and 'windows'):
My current regex:
[a-zA-Z0-9_.\-\\]*(qwert2asdf)[a-zA-Z0-9_.\-\\]*(windows)[a-zA-Z0-9_.\-\\]*
matches line #4 which is what I need
4\\abc123-smb.ccabc.com\nfs\site\disks\.abcdefghigk.1234\abcdfff\day.abcdefg_urururur_20140701_1815\nnn-pppp\ass_bss_sw_qwert2asdf_evih_iii_ass_2401_system.abcdefg_urururur_20140701_1815.windows.tar.gz
question 1.is there a better way so I don't have to repeat [a-zA-Z0-9_.-\]*
question 2. how do I make the match so that it ignores the order of 'qwert2asdf' and 'windows', that is if 'windows' happen before 'qwert2asdf' and it'll still match?
1\\abc123-smb.ccabc.com\nfs\site\disks\.abcdefghigk.1234\abcdfff\day.abcdefg_urururur_20140701_1815\nnn-pppp\css_boxt_pkg_isys.abcdefg_urururur_20140701_1815.linux.tar.gz
2\\abc123-smb.ccabc.com\nfs\site\disks\.abcdefghigk.1234\abcdfff\day.abcdefg_urururur_20140701_1815\nnn-pppp\bbb_pkg_all_systems.abcdefg_urururur_20140701_1815.tar.gz
3\\abc123-smb.ccabc.com\nfs\site\disks\.abcdefghigk.1234\abcdfff\day.abcdefg_urururur_20140701_1815\nnn-pppp\ass_bss_sw_qwert2asdf_evih_iii_ass_2401_system.abcdefg_urururur_20140701_1815.linux.tar.gz
4\\abc123-smb.ccabc.com\nfs\site\disks\.abcdefghigk.1234\abcdfff\day.abcdefg_urururur_20140701_1815\nnn-pppp\ass_bss_sw_qwert2asdf_evih_iii_ass_2401_system.abcdefg_urururur_20140701_1815.windows.tar.gz
5\\abc123-smb.ccabc.com\nfs\site\disks\.abcdefghigk.1234\abcdfff\day.abcdefg_urururur_20140701_1815\nnn-pppp\ass_bss_sw_qwert2asdf_evih_iii_ass_2401_system.abcdefg_urururur_20140701_1815_vp.tar.gz
6\\abc123-smb.ccabc.com\nfs\site\disks\.abcdefghigk.1234\abcdfff\day.abcdefg_urururur_20140701_1815\nnn-pppp\ass_bss_sw_evih_iii_ass_2401_system.abcdefg_urururur_20140701_1815.linux.tar.gz
7\\abc123-smb.ccabc.com\nfs\site\disks\.abcdefghigk.1234\abcdfff\day.abcdefg_urururur_20140701_1815\nnn-pppp\ass_bss_sw_evih_iii_ass_2401_system.abcdefg_urururur_20140701_1815.windows.tar.gz
8\\abc123-smb.ccabc.com\nfs\site\disks\.abcdefghigk.1234\abcdfff\day.abcdefg_urururur_20140701_1815\nnn-pppp\ass_bss_sw_evih_iii_ass_2401_system.abcdefg_urururur_20140701_1815_vp.tar.gz
9\\abc123-smb.ccabc.com\nfs\site\disks\.abcdefghigk.1234\abcdfff\day.abcdefg_urururur_20140701_1815\nnn-pppp\ass_bss_sw_evih_iii_ass_system.abcdefg_urururur_20140701_1815.tar.gz
10\\abc123-smb.ccabc.com\nfs\site\disks\.abcdefghigk.1234\abcdfff\day.abcdefg_urururur_20140701_1815\nnn-pppp\doc_pkg_evih_iii_ass_system.abcdefg_urururur_20140701_1815.tar.gz
11\\abc123-smb.ccabc.com\nfs\site\disks\.abcdefghigk.1234\abcdfff\day.abcdefg_urururur_20140701_1815\nnn-pppp\ass_bss_sw_evih_iii_ass_2400_system.abcdefg_urururur_20140701_1815.linux.tar.gz
12\\abc123-smb.ccabc.com\nfs\site\disks\.abcdefghigk.1234\abcdfff\day.abcdefg_urururur_20140701_1815\nnn-pppp\ass_bss_sw_evih_iii_ass_2400_system.abcdefg_urururur_20140701_1815.windows.tar.gz
13\\abc123-smb.ccabc.com\nfs\site\disks\.abcdefghigk.1234\abcdfff\day.abcdefg_urururur_20140701_1815\nnn-pppp\ass_bss_sw_evih_iii_ass_2400_system.abcdefg_urururur_20140701_1815_vp.tar.gz
14\\abc123-smb.ccabc.com\nfs\site\disks\.abcdefghigk.1234\abcdfff\day.abcdefg_urururur_20140701_1815\nnn-pppp\css_pkg_css_skm_cgdsg0_system.abcdefg_urururur_20140701_1815.tar.gz
15\\abc123-smb.ccabc.com\nfs\site\disks\.abcdefghigk.1234\abcdfff\day.abcdefg_urururur_20140701_1815\nnn-pppp\css_pkg_css_skm_asdfgt_system.abcdefg_urururur_20140701_1815.tar.gz
16\\abc123-smb.ccabc.com\nfs\site\disks\.abcdefghigk.1234\abcdfff\day.abcdefg_urururur_20140701_1815\nnn-pppp\ass_bss_sw_boxtppc_evih_iii_ass_2401_system.abcdefg_urururur_20140701_1815.linux.tar.gz
17\\abc123-smb.ccabc.com\nfs\site\disks\.abcdefghigk.1234\abcdfff\day.abcdefg_urururur_20140701_1815\nnn-pppp\ass_bss_sw_boxtppc_evih_iii_ass_2401_system.abcdefg_urururur_20140701_1815.windows.tar.gz
18\\abc123-smb.ccabc.com\nfs\site\disks\.abcdefghigk.1234\abcdfff\day.abcdefg_urururur_20140701_1815\nnn-pppp\ass_bss_sw_boxtppc_evih_iii_ass_2401_system.abcdefg_urururur_20140701_1815_vp.tar.gz
19\\abc123-smb.ccabc.com\nfs\site\disks\.abcdefghigk.1234\abcdfff\day.abcdefg_urururur_20140701_1815\nnn-pppp\ia_css_2.1.3.0.abcdefg_urururur_20140701_1815.linux.tar.gz
20\\abc123-smb.ccabc.com\nfs\site\disks\.abcdefghigk.1234\abcdfff\day.abcdefg_urururur_20140701_1815\nnn-pppp\ia_css_2.1.3.0.abcdefg_urururur_20140701_1815.windows.tar.gz
21\\abc123-smb.ccabc.com\nfs\site\disks\.abcdefghigk.1234\abcdfff\day.abcdefg_urururur_20140701_1815\nnn-pppp\doc_pkg_ia_css_2.1.3.0.abcdefg_urururur_20140701_1815.tar.gz
You can use Positive Lookahead here.
^(?=.*qwert2asdf)(?=.*windows)[\w\\.-]*$
Explanation:
^ # the beginning of the string
(?= # look ahead to see if there is:
.* # any character except \n (0 or more times)
qwert2asdf # 'qwert2asdf'
) # end of look-ahead
(?= # look ahead to see if there is:
.* # any character except \n (0 or more times)
windows # 'windows'
) # end of look-ahead
[\w\\.-]* # any character of: word characters (a-z, A-Z, 0-9, _),
# '\\', '.', '-' (0 or more times)
$ # before an optional \n, and the end of the string
Live Demo
This should work:
(.*?)windows(.*)
You did not say which regex flavor you are using (POSIX, Perl, Java, ...), but I am unaware of any that has a way to write a pattern that matches the same set of strings as yours without repeating the character class as yours does.
You might be tempted to look at back references, but they do not do what you want.
Depending on the host language, however, you might be able to reduce duplication by putting the text of the character class into a variable, and interpolating the variable into your regular expression at each of the three points.
Matching regardless of the order of the 'qwert2asdf' and 'windows' substrings is messier, but it can be done. Here's one way that should work in pretty much any regex engine, modulo any metacharacter (non-)escaping that might need to be performed:
[a-zA-Z0-9_.\-\\]*((qwert2asdf)[a-zA-Z0-9_.\-\\]*(windows)|(windows)[a-zA-Z0-9_.\-\\]*(qwert2asdf))[a-zA-Z0-9_.\-\\]*
A regex engine that supports zero-width lookbehind assertions would provide other alternatives, but I don't think any would come out shorter.
If I want to find all periods that ARE at the end of paragraphs, I could do \.($|\n). But how can I negate that and say "a period followed by any character that ISN'T one of these, given that metacharacters don't work inside character classes, which stops me using negated character classes?
What's in a $? It depends!
The answer very much depends on which language and regex engine you're using. You see,
In Java, the $ asserts that we are positioned at the end of the string or before any carriage return or newline at the end of the string. So you'd be safe with a \.(?!$)
In PCRE, C# and Python, the $ asserts that we are positioned at the end of the string or before any newline at the end of the string. So you'd could use a \.(?!$|\r)
In JavaScript and Ruby, the $ asserts that we are positioned at the end of the string. So you'd need to go the full Monty with a \.(?!$|[\r\n]).
Therefore, for a multi-engine solution, the safest would be:
\.(?!$|[\r\n])
But in the right context, the other two options are perfectly acceptable.
Explanation
\. matches the literal period
The negative lookahead (?!$|[\r\n]) asserts that what follows is neither the "end of the string" nor a carriage return nor a newline.
Use a Negative Lookahead to do this.
\.(?!\n|$)
Explanation:
\. '.'
(?! look ahead to see if there is not:
\n '\n' (newline)
| OR
$ before an optional \n, and the end of the string
) end of look-ahead
Live Demo
The most useful longhand version of the negatively looked ahead EOL check after the period winds up making your entire pattern something like this:
(?x: # enable comments
\. # a literal dot character
(?! # look ahead for not the following{
\R ? # optional EOL grapheme cluster
\z # at the true end of string
) # } end look ahead
)
That assumes you don’t want it match “interstitially” (that is, before any line-terminator grapheme), which would be the simpler:
(?=\R)
Some argument can be made for that \R? being made into a \R* instead, in case you should happen to have multiple line-terminators at the end of a record, like several newlines in a row. That way 0, 1, 2, or however many EOL graphemes are allowed before the end of the string.
On the other hand, it may well be the case that a paragraph must be at least two EOL graphemes, not just one alone. For example, this is true in markup here and in other files with “blank-line separated” types of paragraphs. So no EOLs are ok, and two or more are too, but not just one of them.
For such text, you would need \R{2,}, but the whole bit would be optionalized, yielding in that case:
(?x: # enable comments
\. # a literal dot character
(?! # look ahead for NOT the following {
(?:
\R {2,} # two or more EOL grapheme clusters
) ? # # optionally
\z # at the true end of string
) # } end negated look ahead
)
If you don’t have \R from UTS 18: Unicode Regular Expressions — Line Boundaries in your regex flavor, then you will have to write it out the hard way, which is the rather annoying:
(?x: # We are emulating \R per UTS#18
(?> # Prohibit backtrack within subpattern
\r \n # Match a CRLF without backtracking
# or else any code point with the
# vertical space character property
# \p{VertSpace}, here enumerated in full
| [\x0A-\x0D\x85\x{2028}\x{2029}]
)
)
You need the no-backtracking bit to avoid something like \R{2} being allowed to match a single CRLF, and it isn’t allowed to do that.
One final thing to consider is whether you want to allow for optional horizontal whitespace to intervene between the period and the EOL. I rather imagine that you do, but without a tighter formal specification in the OP, it’s impossible to say so definitely.
You should use a negative lookahead.
\.(?!$|\n)
More on this: http://www.regular-expressions.info/lookaround.html
I am working in VB.Net and trying to use Regex.Replace to format a string I am using to query Sql. What Im going for is to cut out comments "--". I've found that in most cases the below works for what I need.
string = Regex.Replace(command, "--.*\n", "")
and
string = Regex.Replace(command, "--.*$", "")
However I have ran into a problem. If I have a string inside of my query that contains the double dash string it doesn't work, the replace will just cut out the whole line starting at the double dash. It makes since to me as to why but I can't figure out the regular expression i need to match on.
logically I need to match on a string that starts with "--" and is not proceeded by "'" and not followed by "'" with any number of characters inbetween. But Im not sure how to express that in a regular expression. I have tried variations of:
string = Regex.Replace(cmd, "[^('.*)]--.*\n[^(.*')]", "")
Which I know is obviously wrong. I have looked at a couple of online resources including http://www.codeproject.com/KB/dotnet/regextutorial.aspx
but due to my lack of understanding I can't seem to figure this one out.
I think you meant "match on a string that starts with -- and is not proceededpreceeded by ' and not followed by ' with any number of characters inbetween"
If so, then this is what you are looking for:
string = Regex.Replace(cmd, "(?<!'.*?--)--(?!.*?').*(?=\r\n)", "")
'EDIT: modified a little
Of course, it means you can't have apostrophes in your comments... and would be exceedingly easy to hack if someone wanted to (you aren't thinking of using this to protect against injection attacks, are you? ARE YOU!??! :D )
I can break down the expression if you'd like, but it's essentially the same as my modified quote above!
EDIT:
I modified the expression a little, so it does not consume any carriage return, only the comment itself... the expression says:
(?<! # negative lookbehind assertion*
' # match a literal single quote
.*? # followed by anything (reluctantly*)
-- # two literal dashes
) # end assertion
-- # match two literal dashes
(?! # negative lookahead assertion
.*? # match anything (reluctant)
' # followed by a literal single quote
) # end assertion
.* # match anything
(?= # positive lookahead assertion
\r\n # match carriage-return, line-feed
) # end assertion
negative lookbehind assertion means at this point in the match, look backward here and assert that this cannot be matched
negative lookahead assertion means look forward from this point and assert this cannot be matched
positive lookahead asserts the following expression CAN be matched
reluctant means only consume a match for the previous atom (the . which means everything in this case) if you cannot match the expression that follows. Thus the .*? in .*?-- (when applied against the string abc--) will consume a, then check to see if the -- can be matched and fail; it will then consume ab, but stop again to see if the -- can be matched and fail; once it consumes abc and the -- can be matched (success), it will finally consume the entire abc--
non-reluctant or "greedy" which would be .* without the ? will match abc-- with the .*, then try to match the end of the string with -- and fail; it will then backtrack until it can match the --
one additional note is that the . "anything" does not by default include newlines (carriage-return/line-feed), which is needed for this to work properly (there is a switch that will allow . to match newlines and it will break this expression)
A good resource - where I've learned 90% of what I know about regex - is Regular-Expressions.info
Tread carefully and good luck!
OK what you are doing here is not right :
/[^('.*)]--.*\n[^(.*')]/
You are saying the following :
Do not match a (, ), ', ., * then match -- then match anything until a newline and to not match the same character class as the one at the start.
What you probably meant to do is this :
/(?<!['"])\s*--.*[\r\n]*/
Which says, make sure that you don't match a ' or " match any whitespace match -- and anything else until the end or a newline or line feed character.
I am having problems with the nested '+'/'-' lookahead/lookbehind in regex.
Let's say that I want to change the '*' in a string with '%' and let's say that '\' escapes the next character. (Turning a regex to sql like command ^^).
So the string
'*test*' should be changed to '%test%',
'\\*test\\*' -> '\\%test\\%', but
'\*test\*' and '\\\*test\\\*' should stay the same.
I tried:
(?<!\\)(?=\\\\)*\* but this doesn't work
(?<!\\)((?=\\\\)*\*) ...
(?<!\\(?=\\\\)*)\* ...
(?=(?<!\\)(?=\\\\)*)\* ...
What is the correct regex that will match the '*'s in examples given above?
What is the difference between (?<!\\(?=\\\\)*)\* and (?=(?<!\\)(?=\\\\)*)\* or if these are essentially wrong the difference between regex that have such a visual construction?
To find an unescaped character, you would look for a character that is preceded by an even number of (or zero) escape characters. This is relatively straight-forward.
(?<=(?<!\\)(?:\\\\)*)\* # this is explained in Tim Pietzcker' answer
Unfortunately, many regex engines do not support variable-length look-behind, so we have to substitute with look-ahead:
(?=(?<!\\)(?:\\\\)*\*)(\\*)\* # also look at ridgerunner's improved version
Replace this with the contents of group 1 and a % sign.
Explanation
(?= # start look-ahead
(?<!\\) # a position not preceded by a backslash (via look-behind)
(?:\\\\)* # an even number of backslashes (don't capture them)
\* # a star
) # end look-ahead. If found,
( # start group 1
\\* # match any number of backslashes in front of the star
) # end group 1
\* # match the star itself
The look-ahead makes sure only even numbers of backslashes are taken into account. Anyway, there is no way around matching them into a group, since the look-ahead does not advance the position in the string.
Ok, since Tim decided to not update his regex with my suggested mods (and Tomalak's answer is not as streamlined), here is my recommended solution:
Replace: ((?<!\\)(?:\\\\)*)\* with $1%
Here it is in the form of a commented PHP snippett:
// Replace all non-escaped asterisks with "%".
$re = '% # Match non-escaped asterisks.
( # $1: Any/all preceding escaped backslashes.
(?<!\\\\) # At a position not preceded by a backslash,
(?:\\\\\\\\)* # Match zero or more escaped backslashes.
) # End $1: Any preceding escaped backslashes.
\* # Unescaped literal asterisk.
%x';
$text = preg_replace($re, '$1%', $text);
Addendum: Non-lookaround JavaScript Solution
The above solution does require lookbehind, so it will not work in JavaScript. The following JavaScript solution does not use lookbehind:
text = text.replace(/(\\[\S\s])|\*/g,
function(m0, m1) {
return m1 ? m1 : '%';
});
This solution replaces each instance of backslash-anything with itself, and each instance of * asterisk with a % percent sign.
Edit 2011-10-24: Fixed Javascript version to correctly handle cases such as: **text**. (Thanks to Alan Moore for pointing out the error in previous version.)
Others have shown how this can be done with a lookbehind, but I'd like to make a case for not using lookarounds at all. Consider this solution (demo here):
s/\G([^*\\]*(?:\\.[^*\\]*)*)\*/$1%/g;
The bulk of the regex, [^*\\]*(?:\\.[^*\\]*)*, is an example of Friedl's "unrolled loop" idiom. It consumes as many as it can of individual characters other than asterisk or backslash, or pairs of characters consisting of a backslash followed by anything. That allows it to avoid consuming unescaped asterisks, no matter how many escaped backslashes (or other characters) precede them.
The \G anchors each match to the position where the previous match ended, or to the beginning of the input if this is the first match attempt. This prevents the regex engine from simply skipping over escaped backslashes and matching the unescaped asterisks anyway. So, each iteration of the /g controlled match consumes everything up to the next unescaped asterisk, capturing all but the asterisk in group #1. Then that's plugged back in and the * is replaced with %.
I think this is at least as readable as the lookaround approaches, and easier to understand. It does require support for \G, so it won't work in JavaScript or Python, but it works just fine in Perl.
So you essentially want to match * only if it's preceded by an even number of backslashes (or, in other words, if it isn't escaped)? Then you don't need lookahead at all since you're only looking back, aren't you?
Search for
(?<=(?<!\\)(?:\\\\)*)\*
and replace with %.
Explanation:
(?<= # Assert that it's possible to match before the current position...
(?<!\\) # (unless there are more backslashes before that)
(?:\\\\)* # an even number of backslashes
) # End of lookbehind
\* # Then match an asterisk
The problem of detecting escaped backslashes in regex has fascinated me for a while, and it wasn't until recently that I realized I was completely overcomplicating it. There are a couple of things that make it simpler, and as far as I can tell nobody here has noticed them yet:
Backslashes escape any character after them, not just other backslashes. So (\\.)* will eat an entire chain of escaped characters, whether they're backslashes or not. You don't have to worry about even- or odd-numbered slashes; just check for a solitary \ at the beginning or end of the chain (ridgerunner's JavaScript solution does take advantage of this).
Lookarounds aren't the only way to make sure you start with the first backslash in a chain. You can just look for a non-backslash character (or the start of the string).
The result is a short, simple pattern that doesn't need lookarounds or callbacks, and it's shorter than anything else I see so far.
/(?!<\\)(\\.)*\*/g
And the replacement string:
"$1%"
This works in .NET, which allows lookbehinds, and it should work for you in Perl. It's possible to do it in JavaScript, but without lookbehinds or the \G anchor, I can't see a way to do it in a one-liner. Ridgerunner's callback should work, as will a loop:
var regx = /(^|[^\\])(\\.)*\*/g;
while (input.match(regx)) {
input = input.replace(regx, '$1$2%');
}
There are a lot of names here I recognize from other regex questions, and I know some of you are smarter than me. If I've made a mistake, please say so.
The following regex finds text between substrings FTW and ODP.
/FTW(((?!FTW|ODP).)+)ODP+/
What does the (?!...) do?
(?!regex) is a zero-width negative lookahead. It will test the characters at the current cursor position and forward, testing that they do NOT match the supplied regex, and then return the cursor back to where it started.
The whole regexp:
/
FTW # Match Characters 'FTW'
( # Start Match Group 1
( # Start Match Group 2
(?!FTW|ODP) # Ensure next characters are NOT 'FTW' or 'ODP', without matching
. # Match one character
)+ # End Match Group 2, Match One or More times
) # End Match Group 1
OD # Match characters 'OD'
P+ # Match 'P' One or More times
/
So - Hunt for FTW, then capture while looking for ODP+ to end our string. Also ensure that the data between FTW and ODP+ doesn't contain FTW or ODP
From perldoc:
A zero-width negative look-ahead assertion. For example /foo(?!bar)/ matches any occurrence of "foo" that isn't followed by "bar". Note however that look-ahead and look-behind are NOT the same thing. You cannot use this for look-behind.
If you are looking for a "bar" that isn't preceded by a "foo", /(?!foo)bar/ will not do what you want. That's because the (?!foo) is just saying that the next thing cannot be "foo"--and it's not, it's a "bar", so "foobar" will match. You would have to do something like /(?!foo)...bar/ for that. We say "like" because there's the case of your "bar" not having three characters before it. You could cover that this way: /(?:(?!foo)...|^.{0,2})bar/ . Sometimes it's still easier just to say:
if (/bar/ && $` !~ /foo$/)
It means "not followed by...". Technically this is what's called a negative lookahead in that you can peek at what's ahead in the string without capturing it. It is a class of zero-width assertion, meaning that such expressions don't capture any part of the expression.
The programmer must have been typing too fast. Some characters in the pattern got flipped. Corrected:
/WTF(((?!WTF|ODP).)+)ODP+/
Regex
/FTW(((?!FTW|ODP).)+)ODP+/
matches first FTW immediately followed neither by FTW nor by ODP, then all following chars up to the first ODP (but if there is FTW somewhere in them there will be no match) then all the letters P that follow.
So in the string:
FTWFTWODPFTWjjFTWjjODPPPPjjODPPPjjj
it will match the bold part
FTWFTWODPFTWjjFTWjjODPPPPjjODPPPjjj
'?!' is actually part of '(?! ... )', it means whatever is inside must NOT match at that location.