Regex match any character NOT followed by "? something" - regex

How can I match a path only if there is no "?" plus zero or more character on the end.
I have the following path:
/something/contentimg/coast03.jpg?itok=ABC
I want the filename, but only if there is no "?something" after the file extension.
I tried:
/^.*\/(.*)(?!\?.*)$/
But it matches anyway. This is the result. What am I doing wrong?
Array
(
[0] => /something/contentimg/coast03.jpg?itok=ABC
[1] => coast03.jpg?itok=ABC
)
Using php.

Use parse_url:
print_r(parse_url('/something/contentimg/coast03.jpg?itok=ABC'))
(
[path] => /something/contentimg/coast03.jpg
[query] => itok=ABC
)

The * quantifier behaves greedily and matches everything up to the end of the regular expression, so the negative lookahead kicks in at the end of the input (and of course doesn't find what it's looking for). The regex should be done a little differently:
/^.*\/([^?]+)$/
This expression matches one or more non-question-mark characters and then asserts that it has reached the end of the input string, which is what you want to do.

^.*\/([^?]+)(?![?].+)$
Working DEMO
Your expression does not work, because (.*) matches everything after last \, so there is nothing that could be considered as negative lookahead input.

This is how it's currently matching:
.* - greedily matches up to before the last / - /something/contentimg
\/ - matches /
(.*) - matches the rest of the string - coast03.jpg?itok=ABC
(?!\?.*) - checks that the characters following don't match, since we are at the end already, it obviously won't match.
What you should do:
It seems like you can just check if a ? exists in the string, so try:
/^(?!.*\?)/
Or match up to the last /, then check for a ? from there:
/^(?!.*\/.*\?)/
Explanation:
You already know (?!...) is negative look-ahead, you're just not entirely sure how to use it. Wherever you put it, it tries its best to match the given pattern from that position onwards. If it succeeds, the regex doesn't match. So it might be a good idea to put this at the very beginning and try to match the rest of the string.
So the basic format for this example is:
/^(?!...).*$/
where (?!...) contains a pattern for the strings you want to exclude.
The .*$ at the end shouldn't be required, and if you want to check the entire string, remember the $ at the end of the look-ahead.
/^(?!...$)/

Related

Regex in middle of text doesn't match

I have a regex to find url's in text:
^(?!:\/\/)([a-zA-Z0-9-_]+\.)*[a-zA-Z0-9][a-zA-Z0-9-_]+\.[a-zA-Z]{2,11}?$
However it fails when it is surrounded by text:
https://regex101.com/r/0vZy6h/1
I can't seem to grasp why it's not working.
Possible reasons why the pattern does not work:
^ and $ make it match the entire string
(?!:\/\/) is a negative lookahead that fails the match if, immediately to the right of the current location, there is :// substring. But [a-zA-Z0-9-_]+ means there can't be any ://, so, you most probably wanted to fail the match if :// is present to the left of the current location, i.e. you want a negative lookbehind, (?<!:\/\/).
[a-zA-Z]{2,11}? - matches 2 chars only if $ is removed since the {2,11}? is a lazy quantifier and when such a pattern is at the end of the pattern it will always match the minimum char amount, here, 2.
Use
(?<!:\/\/)([a-zA-Z0-9-_]+\.)*[a-zA-Z0-9][a-zA-Z0-9-_]+\.[a-zA-Z]{2,11}
See the regex demo. Add \b word boundaries if you need to match the substrings as whole words.
Note in Python regex there is no need to escape /, you may replace (?<!:\/\/) with (?<!://).
The spaces are not being matched. Try adding space to the character sets checking for leading or trailing text.

Using RegEx to mach the beginning of string if end of string is not

I am trying to match lines in a configuration that start with the word "deny" but do not end with the word "log". This seems terribly elementary but I can not find my solution in any of the numerous forums I have looked. My beginners mindset led me to try "^deny.* (?!log$)" Why wouldn't this work? My understanding is that it would find any strings that begin with "deny" followed by any character for 0 or more digits where the end of line is something other than log.
When given a line like deny this log, your ^deny.*(?!log$) regex (I'm omitting the space that was in your sample question) is evaluated as follows:
^deny matches "deny".
.* means "match 0 or more of any character", so it can match " this log".
^(?!log$) means "make sure that the next characters aren't 'log' then the end of the line." In this case, they're not - they're just the end of the line - so the regex matches.
Try this regex instead:
^deny.*$(?<!log)
"Match deny at the beginning of the string, then match to the end of the line, then use a zero-width negative look-behind assertion to check that whatever we just matched at the end of the line is not 'log'."
With all of that said...
Regexes aren't necessarily the best tool for the job. In this case, a simple Boolean operator like
if (/^deny/ and not /log$/)
is probably clearer than a more advanced regex like
if (/^deny.*$(?<!log)/)
(?!log$) is a zero-width negative look-ahead assertion that means don't match if immediately ahead at this point in the string is log and the end of the string, but the .* in your regex has already greedily consumed all the characters right up to the end of the string so there is no way the log could then match.
If your regular expression implementation supports look-behinds you could use a regex such as in Josh Kelley's answer, if you were using javascript you could use
/^deny(?:.{0,2}|.*(?!log)...)$/m
The m flag means multiline mode, which makes ^ and $ match the start and end of every line rather than just the start and end of the string.
Note that three . are positioned after the negative look-ahead so that it has space to match log if it is there. Including these three dots meant it was also necessary to add the .{0,2} option so that strings with from zero to two characters after deny would also match. The (?:a|b) means a non-capturing group where a or b has to match.

Regular expression with possible hyphen and then a limited number of words characters

I need a regex to match expressions which contain the string OKAY then a possible hyphen, and then zero or one word characters. after this any non-word-character is accepted and then anything. for expressions which match, OKAY will be changed to OK if there is no word-character following, and to e.g: OA if the letter following is A. if the hyphen exists it is dropped.
OKAY => OK
OKAY- => OK
OKAYA => OA
OKAY-A => OA
OKAYAB => OKAYAB (no-match)
OKAY-AB => OKAY-AB (no-match)
examples may be followed by e.g: .CD without changing the results
OKAY.CD => OK.CD
OKAY-.CD => OK.CD
OKAYA.CD => OA.CD
OKAY-A.CD => OA.CD
OKAYAB.CD => OKAYAB.CD (no-match)
OKAY-AB.CD => OKAY-AB.CD (no-match)
my problem implementing this was that since both the hyphen and the word-character are optional, I get "lazy" matches which match also the non-wanted cases.
for the sake of education I would appreciate examples both with and without look-aheads (if possible).
Here is a regex that should work for you:
\bOKAY(?>-?)(\w)?([^\w\s]\S*)?(?!\S)
Since it isn't clear what language you are using, here is pseudo code for how you would do the replacement.
"O" + (match.group(1) if match.group(1) else "K") + match.group(2)
Here is a rubular: http://www.rubular.com/r/SE8MBkUUUo
edit: I made some changes in the above regex after the comments, but the description below does not reflect those changes. Here are the changes from the original regex:
Changed ^ to \b so it doesn't need to start at beginning of line
\W became [^\w\s], this prevents OKAY OKAY from being one match
Changed .* to \S* so the match will end at whitespace
Changed $ to (?!\S), (?!\S) means "only match if we are at the end of the string or the next character is whitespace", could also be written as (?=\s|\z)
The really tricky part here is that a regex like ^OKAY-?(\w)?(\W.*)?$ looks like it would work, but it does not for a case like OKAY-AB because in the end both the -? and the (\w)? will not match, and then (\W.*)? will match the remainder of the string.
What we need to do to fix this is make it so -? will not backtrack. This would be simple if possessive quantifiers were supported by .NET, then we could just change it to -?+.
Unfortunately they aren't supported, so we need to use atomic grouping instead. (?>-?) will optionally match a -, but will forget all backtracking information as soon as it exits the group. Note that the atomic group does not capture, so (\w)? is capture group 1.
Don't know .NET regex, but this is a start with preg-style matching:
OKAY-?(\w?)([^\w-]\w+)?\s*$
If $1 is empty, then output is OK$2
Otherwise, output is O$1$2.
To do this without lookaheads, you can use
^(OKAY)(((-\w?|\w)(\W.*)?)|[^-\w].*)?$
This matches the word "OKAY" and then an optional group containing either a -, an optional word character, and then an optional non-word-character followed by anything group, or a character that is not a - or a word character followed by anything. The ^ and $ match the start and end of the string respectively, so it will only match exactly the acceptable strings.
Lookaheads would barely make a difference. The only change would be to put a lookahead ((?=...)) around everything after the "OKAY" group.
To use this with .net, the only change needed would be to escape all of the \ in the string.

Regex.Replace formatting a query

I am working in VB.Net and trying to use Regex.Replace to format a string I am using to query Sql. What Im going for is to cut out comments "--". I've found that in most cases the below works for what I need.
string = Regex.Replace(command, "--.*\n", "")
and
string = Regex.Replace(command, "--.*$", "")
However I have ran into a problem. If I have a string inside of my query that contains the double dash string it doesn't work, the replace will just cut out the whole line starting at the double dash. It makes since to me as to why but I can't figure out the regular expression i need to match on.
logically I need to match on a string that starts with "--" and is not proceeded by "'" and not followed by "'" with any number of characters inbetween. But Im not sure how to express that in a regular expression. I have tried variations of:
string = Regex.Replace(cmd, "[^('.*)]--.*\n[^(.*')]", "")
Which I know is obviously wrong. I have looked at a couple of online resources including http://www.codeproject.com/KB/dotnet/regextutorial.aspx
but due to my lack of understanding I can't seem to figure this one out.
I think you meant "match on a string that starts with -- and is not proceededpreceeded by ' and not followed by ' with any number of characters inbetween"
If so, then this is what you are looking for:
string = Regex.Replace(cmd, "(?<!'.*?--)--(?!.*?').*(?=\r\n)", "")
'EDIT: modified a little
Of course, it means you can't have apostrophes in your comments... and would be exceedingly easy to hack if someone wanted to (you aren't thinking of using this to protect against injection attacks, are you? ARE YOU!??! :D )
I can break down the expression if you'd like, but it's essentially the same as my modified quote above!
EDIT:
I modified the expression a little, so it does not consume any carriage return, only the comment itself... the expression says:
(?<! # negative lookbehind assertion*
' # match a literal single quote
.*? # followed by anything (reluctantly*)
-- # two literal dashes
) # end assertion
-- # match two literal dashes
(?! # negative lookahead assertion
.*? # match anything (reluctant)
' # followed by a literal single quote
) # end assertion
.* # match anything
(?= # positive lookahead assertion
\r\n # match carriage-return, line-feed
) # end assertion
negative lookbehind assertion means at this point in the match, look backward here and assert that this cannot be matched
negative lookahead assertion means look forward from this point and assert this cannot be matched
positive lookahead asserts the following expression CAN be matched
reluctant means only consume a match for the previous atom (the . which means everything in this case) if you cannot match the expression that follows. Thus the .*? in .*?-- (when applied against the string abc--) will consume a, then check to see if the -- can be matched and fail; it will then consume ab, but stop again to see if the -- can be matched and fail; once it consumes abc and the -- can be matched (success), it will finally consume the entire abc--
non-reluctant or "greedy" which would be .* without the ? will match abc-- with the .*, then try to match the end of the string with -- and fail; it will then backtrack until it can match the --
one additional note is that the . "anything" does not by default include newlines (carriage-return/line-feed), which is needed for this to work properly (there is a switch that will allow . to match newlines and it will break this expression)
A good resource - where I've learned 90% of what I know about regex - is Regular-Expressions.info
Tread carefully and good luck!
OK what you are doing here is not right :
/[^('.*)]--.*\n[^(.*')]/
You are saying the following :
Do not match a (, ), ', ., * then match -- then match anything until a newline and to not match the same character class as the one at the start.
What you probably meant to do is this :
/(?<!['"])\s*--.*[\r\n]*/
Which says, make sure that you don't match a ' or " match any whitespace match -- and anything else until the end or a newline or line feed character.

Regex: what does (?! ...) mean?

The following regex finds text between substrings FTW and ODP.
/FTW(((?!FTW|ODP).)+)ODP+/
What does the (?!...) do?
(?!regex) is a zero-width negative lookahead. It will test the characters at the current cursor position and forward, testing that they do NOT match the supplied regex, and then return the cursor back to where it started.
The whole regexp:
/
FTW # Match Characters 'FTW'
( # Start Match Group 1
( # Start Match Group 2
(?!FTW|ODP) # Ensure next characters are NOT 'FTW' or 'ODP', without matching
. # Match one character
)+ # End Match Group 2, Match One or More times
) # End Match Group 1
OD # Match characters 'OD'
P+ # Match 'P' One or More times
/
So - Hunt for FTW, then capture while looking for ODP+ to end our string. Also ensure that the data between FTW and ODP+ doesn't contain FTW or ODP
From perldoc:
A zero-width negative look-ahead assertion. For example /foo(?!bar)/ matches any occurrence of "foo" that isn't followed by "bar". Note however that look-ahead and look-behind are NOT the same thing. You cannot use this for look-behind.
If you are looking for a "bar" that isn't preceded by a "foo", /(?!foo)bar/ will not do what you want. That's because the (?!foo) is just saying that the next thing cannot be "foo"--and it's not, it's a "bar", so "foobar" will match. You would have to do something like /(?!foo)...bar/ for that. We say "like" because there's the case of your "bar" not having three characters before it. You could cover that this way: /(?:(?!foo)...|^.{0,2})bar/ . Sometimes it's still easier just to say:
if (/bar/ && $` !~ /foo$/)
It means "not followed by...". Technically this is what's called a negative lookahead in that you can peek at what's ahead in the string without capturing it. It is a class of zero-width assertion, meaning that such expressions don't capture any part of the expression.
The programmer must have been typing too fast. Some characters in the pattern got flipped. Corrected:
/WTF(((?!WTF|ODP).)+)ODP+/
Regex
/FTW(((?!FTW|ODP).)+)ODP+/
matches first FTW immediately followed neither by FTW nor by ODP, then all following chars up to the first ODP (but if there is FTW somewhere in them there will be no match) then all the letters P that follow.
So in the string:
FTWFTWODPFTWjjFTWjjODPPPPjjODPPPjjj
it will match the bold part
FTWFTWODPFTWjjFTWjjODPPPPjjODPPPjjj
'?!' is actually part of '(?! ... )', it means whatever is inside must NOT match at that location.