Why is this negative lookahead not working? - regex

I have this regex that is supposed to help me find and replace deprecated mysql queries. For some reason though, once I replace one query, it recaptures the same area and merely extends it to the end of the next deprecated query. I'm trying to solve this by not letting it select a specific keyword that is in the replacement string (stmt), but its ignoring this constraint for some reason.
\(?mis)\$(?<sql>[a-z0-9]*) = (?<query>"select.*?\;)(?:(?!stmt).*?)\$(?<res>[a-z0-9]*?) = mysql_query.*?\;(?<txt>.*?)while\s?\(\$(?<row>[a-z0-9]*?) = mysql.*?\{\ 1
Here is the Regex101 I'm using to debug.
(?:(?!stmt).*?) is the lookahead in question. I want it to allow for an arbitrary amount of text in between the named capture groups before and after.2 The *? should already be forcing it to find the smallest section. As you can see below, there is a perfectly acceptable match starting on line 14 ($sql = "SELECT admin from user where id=" . $userID;), but it is insisting on starting all the way at the top with the old, and already replaced match.
Why is my negative lookahead not working the way I think it should be working?3
1. I'm using (?mis) because PHPStorm doesn't play nice with normal flags.
2. To prevent random code and bad formatting from getting in the way of the pattern
3. If this is an XY problem and I should be forcing the correct match a different way, I'll welcome that as an answer instead.

The negative lookahead is not matching is because it's not matching. It denies a match when the semicolon at the end of the <query> is immediately followed by "stmt", which isn't the case in your code: it's followed by newline, whitespace, dollar sign, then "stmt".
You can fix that part by extending the negative lookahead to (?!\s*\$stmt), but then the second problem becomes evident: that just extends the <query> match to the next semicolon, which isn't followed by a $stmt. You fix that by tightening the match in <query> to match greedily on non-semicolons, rather than non-greedily on anything. That is, (?<query>"select.*?\;) becomes (?<query>"select[^;]*\;). This creates a dead stop to the match at the first semicolon.
This will fail to match if you have any semicolons inside your SQL, but hey.
Does that get the desired result?

Related

regex to match all subfolders of a URL, except a few special ones

OK, I'm writing a regex that I want to match on a certain url path, and all subfolders underneath it, but with a few excluded. for context, this is for use inside verizon edgecast, which is a CDN caching system. it supports regex, but unfortunately i don't know the 'flavor' of regex it supports and the documentation isn't clear about that either. Seems to support all the core regex features though, and that should be all i need. unfortunately reading the documentation requires an account, but you can get the general idea of edgecast here: https://www.verizondigitalmedia.com/platform/edgecast-cdn/
so, here is some sample data:
help
help/good
help/better
help/great
help/bad
help/bad/worse
and here is the regex I am using right now:
(^help$|help\/[^bad].*)
link: https://regex101.com/r/CBWUDE/1
broken down:
( - start capture group
^ - start of string
help - 1st thing that should match
$ - end of string
| - or
help - another thing that should match
\/ - escaped / so i can match help/
[^bad] - match any single character that isn't b, a, or d
. - any character
* - any number of times
) - end capture group
I would like the first 4 to match, but not the last 2, 'bad' or 'bad/worse' should not be matches, and help/anythingelse should be a match
this regex is working for me, except that help/better is not a match. the reason it's not a match, i'm pretty sure, is because better, contains a character that appears inside 'bad'. if i change 'bettter' to 'getter' then it becomes a match, because it no longer has a b in it.
so what i really want is my 'bad' to only match the whole word bad, and not match any thing with b, a, or d in it. I tried using word boundary to do this, but isn't giving me the results i need, but perhaps i just have the syntax wrong, this is what i tried:
(^help$|help\/[^\bbad\b].*)
but does not seem to work, the 'bad' urls are no longer excluded, and help/better is still not matching with that. I think it's because / is not a word boundary. I'm positive my problem with the original regex is with the part:
[^bad] - match any single character that isn't b, a, or d
my question is, how can i turn [^bad] into something that matches anything that doesn't contain the full string 'bad'?
You're going to want to use negative look ahead (?!bad) instead of negating specific letters [^bad]
I think (^help$|help\/(?!bad).*) is what you're looking for
Edit: if you mean anything with the word bad at all, not just help/bad you can make it (?!.*bad.*) This would prevent you from matching help/matbadtom for example. Full regex: (^help$|help\/(?!.*bad.*).*)

Get all the characters until a new date/hour is found

I have to parse a lot of content with a regular expression.
The content might, for example, be:
14-08-2015 14:18 : Example : Hello =) How are you?
What are you doing?
14-08-2015 14:19: Example2 : I'm fine thanks!
I have this regular expression that will of course return 2 matches, and the groups that I need - data, hour, name, multi line message:
(\d{2}-\d{2}-\d{4})\s?(\d{2}:\d{2})\s?:([^:]+):([^\d]+)
The problem is that if a number is written inside the message this will not be OK, because the regex will stop getting more characters.
For example in this case this will not work:
14-08-2015 14:18 : Example : Hello =) How are you?
What are you 2 doing?
14-08-2015 14:19: Example2 : I'm fine thanks!
How do I get all the characters until a new date/hour is found?
The problem is with your final capturing group ([^\d]+).
Instead you can use ((?:(?!\d{2}-\d{2}-\d{4})[\s\S])+)
The outer parenthesis: ((?:(?!\d{2}-\d{2}-\d{4})[\s\S])+) indicate a capturing group
The next set of parenthesis: ((?:(?!\d{2}-\d{2}-\d{4})[\s\S])+) indicate a non-capturing group that we want to match 1 to infinite amount of times.
Inside we have a negative look ahead: ((?:(?!\d{2}-\d{2}-\d{4})[\s\S])+). This says that whatever we are matching cannot include a date.
What we actually capture: ((?:(?!\d{2}-\d{2}-\d{4})[\s\S])+) means we capture every character including a new line.
The entire regex that works looks like this:
(\d{2}-\d{2}-\d{4})\s?(\d{2}:\d{2})\s?:([^:]+):((?:(?!\d{2}-\d{2}-\d{4})[\s\S])+)
https://regex101.com/r/wH5xR2/2
Use a lookahead for dates and get everything up to that.
/^(\d{2}-\d{2}-\d{4})\s?(\d{2}:\d{2})\s?:([^:]+):\s?((?:(?!^\d{2}-\d{2}-\d{4}\s?\d{2}:\d{2}).)*)/sm
I've edited you regex in two ways:
Added ^to the front, ensuring you only start from timestamps on their own line, which should filter out most issues with people posting timestamps
Replaced the last capturing group with ((?:(?!^\d{2}-\d{2}-\d{4}\s?\d{2}:\d{2}).)*)
(?!^\d{2}-\d{2}-\d{4}\s?\d{2}:\d{2}) is a negative lookahead, with date
(?:(lookahead).)* Looks for any amount of characters that aren't followed by a date anchored to the start of a line.
((?:(lookahead).)*) Just captures the group for you.
It's not that efficient, but it works. Note the s flag for dotall (dot matches newlines) and m flag that lets ^ match at the start of line. ^ is necessary in the lookahead so that you don't stop the match in case someone posts a timestamp, and in the start to make sure you only match dates from the start of a line.
DEMO: https://regex101.com/r/rX8eH0/3
DEMO with flags in regex: https://regex101.com/r/rX8eH0/4

Get all matches for a certain pattern using RegEx

I am not really a RegEx expert and hence asking a simple question.
I have a few parameters that I need to use which are in a particular pattern
For example
$$DATA_START_TIME
$$DATA_END_TIME
$$MIN_POID_ID_DLAY
$$MAX_POID_ID_DLAY
$$MIN_POID_ID_RELTM
$$MAX_POID_ID_RELTM
And these will be replaced at runtime in a string with their values (a SQL statement).
For example I have a simple query
select * from asdf where asdf.starttime = $$DATA_START_TIME and asdf.endtime = $$DATA_END_TIME
Now when I try to use the RegEx pattern
\$\$[^\W+]\w+$
I do not get all the matches(I get only a the last match).
I am trying to test my usage here https://regex101.com/r/xR9dG0/2
If someone could correct my mistake, I would really appreciate it.
Thanks!
This will do the job:
\$\$\w+/g
See Demo
Just Some clarifications why your regex is doing what is doing:
\$\$[^\W+]\w+$
Unescaped $ char means end of string, so, your pattern is matching something that must be on the end of the string, that's why its getting only the last match.
This group [^\W+] doesn't really makes sense, groups starting with [^..] means negate the chars inside here, and \W is the negation of words, and + inside the group means literally the char +, so you are saying match everything that is Not a Not word and that is not a + sign, i guess that was not what you wanted.
To match the next word just \w+ will do it. And the global modifier /g ensures that you will not stop on the first match.
This should work - Based on what you said you wanted to match this should work . Also it won't match $$lower_case_strings if that's what you wanted. If not, add the "i" flag also.
\${2}[A-Z_]+/g

regex remove all numbers from a paragraph except from some words

I want to remove all numbers from a paragraph except from some words.
My attempt is using a negative look-ahead:
gsub('(?!ami.12.0|allo.12)[[:digit:]]+','',
c('0.12','1245','ami.12.0 00','allo.12 1'),perl=TRUE)
But this doesn't work. I get this:
"." "" "ami.. " "allo."
Or my expected output is:
"." "" 'ami.12.0','allo.12'
You can't really use a negative lookahead here, since it will still replace when the cursor is at some point after ami.
What you can do is put back some matches:
(ami.12.0|allo.12)|[[:digit:]]+
gsub('(ami.12.0|allo.12)|[[:digit:]]+',"\\1",
c('0.12','1245','ami.12.0 00','allo.12 1'),perl=TRUE)
I kept the . since I'm not 100% sure what you have, but keep in mind that . is a wildcard and will match any character (except newlines) unless you escape it.
Your regex is actually finding every digit sequence that is not the start of "ami.12.0" or "allo.12". So for example, in your third string, it gets to the 12 in ami.12.0 and looks ahead to see if that 12 is the start of either of the two ignored strings. It is not, so it continues with replacing it. It would be best to generalize this, but in your specific case, you can probably achieve this by instead doing a negative lookbehind for any prefixes of the words (that can be followed by digit sequences) that you want to skip. So, you would use something like this:
gsub('(?<!ami\\.|ami\\.12\\.|allo\\.)[[:digit:]]+','',
c('0.12','1245','ami.12.0 00','allo.12 1'),perl=TRUE)

Regex to include one thing but exclude another

I've been having a lot of trouble finding how to write a regex to include certain URLs starting with a specified phrase while excluding another.
We want to include pages that start with:
/womens
/mens
/kids-clothing/boys
/kids-clothing/girls
/homeware
But we want to exclude anything that has /sXXXXXXX in the URL - where the X's are numbers.
I've written this so far to match the below URLs but it's behaving very oddly. Should I be using lookarounds or something?
\/(womens|mens|kids\-clothing\/boys|kids\-clothing\/boys|homeware).*[^s[0-9]+].*
/homeware/bathroom/s2522424/4-tier-pastel-pop-drawers-approx-91cm-x25cm-x-28cm
/homeware/bathroom/towels-and-bathmats
/homeware/bathroom/towels-and-bathmats/s2506420/boutique-luxury-towels
/homeware/bathroom/towels-and-bathmats?page=3&size=36&cols=4&sort=&id=/homeware/bathroom/towels-and-bathmats&priceRange[min]=1&priceRange[max]=14
/homeware/bathroom?page=3&size=36&cols=4&sort=&id=/homeware/bathroom&priceRange[min]=1&priceRange[max]=35
/homeware/bedroom
/homeware/bedroom/bedding-sets
/homeware/bedroom/bedding-sets/s2471012/striped-reversible-printed-duvet-set
/homeware/bedroom/bedding-sets/s2472706/check-printed-reversible-duvet-set
/homeware/bedroom/bedding-sets/s2475332/union-jack-duvet-set
/kids-clothing/boys/shop-by-age/toddler-3mnths-5yrs/s2520246/boys-lollipop-slogan-t-shirt
/kids-clothing/boys/shop-by-age/toddler-3mnths-5yrs/s2520253/boys-2-pack-dinosaur-t-shirts
/kids-clothing/girls/great-value/sale?page=1&size=36&cols=4&sort=price.asc&id=/kids-clothing/girls/great-value/sale&priceRange[min]=0.5&priceRange[max]=7
/kids-clothing/girls/mini-shops/ballet-outfits
/kids-clothing/girls/shop-by-age/baby--newborn-0-18mths
/kids-clothing/girls/shop-by-age/baby--newborn-0-18mths/s2484120/3-pack-frill-pants-pinks
/kids-clothing/girls/shop-by-age/baby--newborn-0-18mths/s2504431/3-pack-l-s-bodysuit
/mens/categories/tops?page=5&size=36&cols=4&sort=&id=/mens/categories/tops&priceRange[min]=2&priceRange[max]=22.5
/mens/categories/trousers-and-chinos
/mens/categories/trousers-and-chinos/s2438566/easy-essential-cuffed-jogging-bottoms
/mens/categories/trousers-and-chinos/s2438574/easy-essential-cuffed-jogging-bottoms
/mens/categories/trousers-and-chinos/s2458939/regatta-zip-off-lightweight-outdoor-trousers
You are on the right track. A negative lookahead will do it:
"^(?!.*\/s\d+)\/(womens|mens|kids\-clothing\/boys|kids\-clothing\/girls|homeware)\/.*"
The ^ anchors to the start of the string. The (?!.*\/s\d+) means that "/sXXXXXXX" can't appear anywhere in the string, and the rest of it matches your required starting tokens.
The reason [^s[0-9]+] didn't work is that [^xyz] matches only one single character. What you're effectively saying there is that you're looking for any character that isn't any combination of "s", "[" and "0-9", followed by "]". e.g. "s[234[s]".
The reason you need to put your negative lookahead at the start of the string is so nothing is matched at all. If you put it after the \/(womens|mens|kids\-clothing\/boys|kids\-clothing\/girls|homeware)\/.*, you would still successfully match everything before the "/sXXXXXXX". i.e. for line 1 of your data, you would match "/homeware/bathroom/".
Yes, you need a negative lookaround:
/^\/(womens|mens|kids\-clothing\/boys|kids\-clothing\/boys|homeware)(?:\/(?:(?!s\d+).)*)+$/gm
If you're comparing one line at a time you don't need the multiline (m) flag. It's probably behaving strangely because you had a character class (denoted by square brakcets) nested inside more square brackets, which doesn't work; you can't nest character classes. This was tested and works on refiddle.