Can I use negative lookahead and other conditions together in regex group? - regex

I'm trying to match some URLs against another table using regex and - because the original source wasn't put together properly, I'm using a regex to clean them within the SQL.
As an example, the URLs might be /this-is-my-test-string/ or /this-is-my-test-string and the reference table is always of the form /this-is-my-test-string so using this regex works well to capture the matching part.
(\/[^\/)]*)\/?
However I've now come across some others with the form /this-is-my-test-string- and /this-is-my-test-string-/ which aren't as straightforward - I can't just add - to the exclusion as it's present in the rest of the string. From reading around - regex is not something I use regularly - a lookahead would seem to be the answer, but I can't work out how to include this in the expression.
Any help would be gratefully received.

You can use $ to anchor the end of the string, and use a non-greedy quantifier *? on the non-slash character set to allow -? to match a - from (or near) the end of the string:
(\/[^\/)]*?)-?\/?$

Related

Regex for "starts with," "does not contain," and "ends with"

I'm trying to search for code within a WordPress site, specifically for a facebook pixel. I'm searching for strings using a regex and I know what the string starts with, ends with, and what the string should NOT contain. I have tried other solutions on SO but with no luck.
The string should start with:
fbq('track'
End with:
);
and NOT contain:
PageView
The expression that I have been playing with to try and do this search is:
^(?=^fbq('track')(?=.*\);$)(?=^(?:(?!PageView).)*$).*$/
From this other StackOverflow question:
Combine Regexp?
However, I keep getting back that this is in an invalid format.
You may use:
^(?!.*PageView)fbq\('track.*\);$
Or:
^fbq\('track(?!.*PageView).*\);$
Demo.
Breakdown:
^ - Beginning of the string.
(?!.*PageView) - Negative Lookahead (does not contain "PageView" from this point forward).
fbq\('track - Match "fbq('track", literally (notice how "(" is escabed: \().
.* - Match zero or more characters (any characters).
\); - Match ");" literally.
$ - End of string.
You can go with the first one!
I already already test it in the regex software what I use to try the "regexes" when I need to. ;)
I'm going to add my litle gain of sand :)
Here you have a good source to read the look-around and look-behind (and negative-look-behind, etc): https://www.regular-expressions.info/lookaround.html
*It contains iformation about the use and restrictions on the most used regex flavors (and it implementation in some programming languages).
First of all, if you are not able to locate the FB Pixel, check if you have Google Tag Manager on the site and perhaps it is added via GTM,
If not, then on with the RegEx...
As this is a script in a template file where it can span multiple lines and have spaces before the text etc, a more flexible pattern would be appropriate.
So the main idea is that you don't use ^ and $ in your pattern.
Example
fbq\('track'(?!.*?PageView)[^)]*\);
The pattern above satisfies the requirements you outlined in the OP, where
fbq\('track' - Literally matches fbq('track' as the start of the string
(?!.*?PageView) - Negative lookahead to fail if PageView is found, .*? is used to lazy match 0 or more characters as we would find PageView sooner than later and don't need to backtrack
As the lookahead above is 0 length, if it passed(PageView not found) the cursor will still be at the end of - fbq('track' <- Cursor here
[^)]* - Matched 0 or more characters until a closing parenthesis is found excluding it
\); - Match ); literally.
I am guessing you might be using VSCode, PhpStorm or similar so I selected JS as the flavor in the example for for compatibility.
If you are using grep say in Linux or a bash terminal on Windows(Not sure of Mac due to grep param compatibility) running this from the Theme directory should show you the files and matches.
grep -Pzro 'fbq\('\''track'\''(?!.*?PageView)[^)]*\);'

Confusion regarding regex pattern

I have tried to write a regex to catch certains words in a sentence but it is not working. The below regex is only working when I give a exact match.
[\s]*((delete)|(exec)|(drop\s*table)|(insert)|(shutdown)|(update)|(\bor\b))
Lets say I send a HTTP Header - headerName = insert it works,
but does not work when I give headerName = awesome insert number
--edit--
#user1180, Yes I can use prepared statements, but we are also looking into the regex part.
#Marcel and Wiktor, yes it is working in that website. I guess my tool is not recognizing the regex. I am using Mulesoft ESB, which uses Matches when the evaluated value fits a given regular expression (regex), specifically a regex "flavor" supported by Java.
It is using something like this,
matches /\+(\d+)\s\((\d+)\)\s(\d+\-\d+)/ and I am not aware of how to write my usecase in this regex format.
My usecase is too catch SQL injection pattern, which would check the request header/queryparam for delete (exec)(drop\s*table)(insert)(shutdown)(update)or parameters.
Since your regex must match the whole input you need to wrap the pattern with .*, something similar to (?s).*(<YOUR PATTERN>).*.
Use
(?s).*\b(delete|exec|drop\s+table|insert|shutdown|update|or)\b.*
Details
(?s) - turns on DOTALL mode where . matches any char
.* - any 0+ chars, as many as possible
\b(delete|exec|drop\s+table|insert|shutdown|update|or)\b - any one of the whole words (note \b is a word boundary construct) in the group
.* - any 0+ chars, as many as possible
I also replaced drop\s*table with drop\s+table since I guess droptable is not expected.

Match a url that does not contain certain word

I need some help for a regular expressions for not matching urls like these one:
/Common/Download.php?file=/path/to/file.pdf
and instead to matching these static urls:
/path/to/file.pdf
I have read many post (also in this site) but nothing seems to works as expected.
Thanks for your helps.
Lorenzo.
UPDATE
Sorry if this post is not so complete. I post more information to obtain a better help.
The regular expression that I need must work with Apache module mod_rewrite (and also with the module mod_rewrite of IIS (maybe this is not the right name) that is compatible with the module of Apache (as from my knowledge), if possible ) and must redirect the matching static urls (only of the second type, as from my post) to a specific page.
Thanks again.
Lorenzo.
Without knowing more about your programming language and regex parser, I'm keeping my regex really generic, but something like this should get you close:
^/([A-Za-z0-9]+/)+[A-Za-z0-9]+\.[A-Za-z0-9]{3,4}$
This matches a string starting with a slash, one or more directories separated by slashes, and ending with a filename with a three or four character file extension.
This means /path/to/some/really/buried/file.html would match too.
Using an interactive regular expression evaluator is a great way to rapidly write and debug regular expressions, especially if you are new to them. I really like The Regex Coach for this.
Another option could be to repeat the forward slash lowercase characters pattern in a non capturing group and repeat that. Then match the file extension .pdf
^(?:/[a-z]+){3}\.pdf$
Explanation
From the beginning of the string ^
Non capturing group (?:
Match one or more lowercase characters [a-z]+
Close the non capturing group and match 3 times ){3}
Match a dot \. and pdf
The end of the string $
Or repeat the group 2 times and for the filename use \w+
^(?:/[a-z]+){2}/\w+\.pdf$
If you want to match your example static url and maybe longer or shorter paths like /path/file.pdf or /dir/path/to/file.pdf you could for example use:
^(?:/\w+)+\.\w+$

Regular expression using negative lookbehind not working in Notepad++

I have a source file with literally hundreds of occurrences of strings flecha.jpg and flecha1.jpg, but I need to find occurrences of any other .jpg image (i.e. casa.jpg, moto.jpg, whatever)
I have tried using a regular expression with negative lookbehind, like this:
(?<!flecha|flecha1).jpg
but it doesn't work! Notepad++ simply says that it is an invalid regular expression.
I have tried the regex elsewhere and it works, here is an example so I guess it is a problem with NPP's handling of regexes or with the syntax of lookbehinds/lookaheads.
So how could I achieve the same regex result in NPP?
If useful, I am using Notepad++ version 6.3 Unicode
As an extra, if you are so kind, what would be the syntax to achieve the same thing but with optional numbers (in this case only '1') as a suffix of my string? (even if it doesn't work in NPP, just to know)...
I tried (?<!flecha[1]?).jpg but it doesn't work. It should work the same as the other regex, see here (RegExr)
Notepad++ seems to not have implemented variable-length look-behinds (this happens with some tools). A workaround is to use more than one fixed-length look-behind:
(?<!flecha)(?<!flecha1)\.jpg
As you can check, the matches are the same. But this works with npp.
Notice I escaped the ., since you are trying to match extensions, what you want is the literal .. The way you had, it was a wildcard - could be any character.
About the extra question, unfortunately, as we can't have variable-length look-behinds, it is not possible to have optional suffixes (numbers) without having multiple look-behinds.
Solving the problem of the variable-length-negative-lookbehind limitation in Notepad++
Given here are several strategies for working around this limitation in Notepad++ (or any regex engine with the same limitation)
Defining the problem
Notepad++ does not support the use of variable-length negative lookbehind assertions, and it would be nice to have some workarounds. Let's consider the example in the original question, but assume we want to avoid occurrences of files named flecha with any number of digits after flecha, and with any characters before flecha. In that case, a regex utilizing a variable-length negative lookbehind would look like (?<!flecha[0-9]*)\.jpg.
Strings we don't want to match in this example
flecha.jpg
flecha1.jpg
flecha00501275696.jpg
aflecha.jpg
img_flecha9.jpg
abcflecha556677.jpg
The Strategies
Inserting Temporary Markers
Begin by performing a find-and-replace on the instances that you want to avoid working with - in our case, instances of flecha[0-9]*\.jpg. Insert a special marker to form a pattern that doesn't appear anywhere else. For this example, we will insert an extra . before .jpg, assuming that ..jpg doesn't appear elsewhere. So we do:
Find: (flecha[0-9]*)(\.jpg)
Replace with: $1.$2
Now you can search your document for all the other .jpg filenames with a simple regex like \w+\.jpg or (?<!\.)\.jpg and do what you want with them. When you're done, do a final find-and-replace operation where you replace all instances of ..jpg with .jpg, to remove the temporary marker.
Using a negative lookahead assertion
A negative lookahead assertion can be used to make sure that you're not matching the undesired file names:
(?<!\S)(?!\S*flecha\d*\.jpg)\S+\.jpg
Breaking it down:
(?<!\S) ensures that your match begins at the start of a file name, and not in the middle, by asserting that your match is not preceded by a non-whitespace character.
(?!\S*flecha\d*\.jpg) ensures that whatever is matched does not contain the pattern we want to avoid
\S+\.jpg is what actually gets matched -- a string of non-whitespace characters followed by .jpg.
Using multiple fixed-length negative lookbehinds
This is a quick (but not-so-elegant) solution for situations where the pattern you don't want to match has a small number of possible lengths.
For example, if we know that flecha is only followed by up to three digits, our regex could be:
(?<!flecha)(?<!flecha[0-9])(?<!flecha[0-9][0-9])(?<!flecha[0-9][0-9][0-9])\.jpg
Are you aware that you're only matching (in the sense of consuming) the extension (.jpg)? I would think you wanted to match the whole filename, no? And that's much easier to do with a lookahead:
\b(?!flecha1?\b)\w+\.jpg
The first \b anchors the match to the beginning of the name (assuming it's really a filename we're looking at). Then (?!flecha1?\b) asserts that the name is not flecha or flecha1. Once that's done, the \w+ goes ahead and consumes the name. Then \.jpg grabs the extension to finish off the match.

Google Analytics Regex - Alternative to no negative lookahead

Google Analytics does not allow negative lookahead anymore within its filters. This is proving to be very difficult to create a custom report only including the links I would like it to include.
The regex that includes negative lookahead that would work if it was enabled is:
test.com(\/\??index\_(.*)\.php\??(.*)|\/\?(.*)|\/|)+(\s)*(?!.)
This matches:
test.com
test.com/
test.com/index_fb2.php
test.com/index_fb2.php?ref=23
test.com/index_fb2.php?ref=23&e=35
test.com/?ref=23
test.com/?ref=23&e=35
and does not match (as it should):
test.com/ambassadors
test.com/admin/?signup=true
test.com/randomtext/
I am looking to find out how to adapt my regex to still hold the same matches but without the use of negative lookahead.
Thank you!
Google Analytics doesn't seem to support single-line and multiline modes, which makes sense to me. URLs can't contain newlines, so it doesn't matter if the dot doesn't match them and there's never any need for ^ and $ to match anywhere but the beginning and end of the whole string.
That means the (?!.) in your regex is exactly equivalent to $, which matches only at the very end of the string (like \z, in flavors that support it). Since that's the only lookahead in your regex, you should never have have had this problem; you should have been using $ all along.
However, your regex has other problems, mostly owing to over-reliance on (.*). For example, it matches these strings:
test.com/?^#(%)!*%supercalifragilisticexpialidocious
test.com/index_ecky-ecky-ecky-ecky-PTANG!-vroop-boing_rowr.php (ni! shh!)
...which I'm pretty sure you don't want. :P
Try this regex:
test\.com(?:/(?:index_\w+\.php)?(?:\?ref=\d+(?:&e=\d+)?)?)?\s*$
or more readably:
test\.com
(?:
/
(?:index_\w+\.php)?
(?:
\?ref=\d+
(?:
&e=\d+
)?
)?
)?
\s*$
For illustration purposes I'm making a lot of simplifying assumptions about (e.g.) what parameters can be present, what order they'll appear in, and what their values can be. I'm also wondering if it's really necessary to match the domain (test.com). I have no experience with Google Analytics, but shouldn't the match start (and be anchored) right after domain? And do you really have to allow for whitespace at the end? It seems to me the regex should be more like this:
^/(?:index_\w+\.php)?(?:\?ref=\d+(?:&e=\d+)?)?$
Firstly I think your regex needs some fixing. Let's look at what you have:
test.com(\/\??index_.*.php\??(.*)|\/\?(.*)|\/|)+(\s)*(?!.)
The case where you use the optional ? at the start of index... is already taken care of by the second alternative:
test.com(\/index_.*.php\??(.*)|\/\?(.*)|\/|)+(\s)*(?!.)
Now you probably only want the first (.*) to be allowed, if there actually was a literal ? before. Otherwise you will match test.com/index_fb2.phpanystringhereandyouprobablydon'twantthat. So move the corresponding optional marker:
test.com(\/index_.*.php(\?(.*))?|\/\?(.*)|\/|)+(\s)*(?!.)
Now .* consumes any character and as much as possible. Also, the . in front of php consumes any character. This means you would be allowing both test.com/index_fb2php and test.com/index_fb2.html?someparam=php. Let's make that a literal . and only allow non-question-mark characters:
test.com(\/index_[^?]*\.php(\?(.*))?|\/\?(.*)|\/|)+(\s)*(?!.)
Now the first and second and third option can be collapsed into one, if we make the file name optional, too:
test.com(\/(index_[^?]*\.php)?(\?(.*))?|)+(\s)*(?!.)
Finally, the + can be removed, because the (.*) inside can already take care of all possible repetitions. Also (something|) is the same as (something)?:
test.com(\/(index_[^?]*\.php)?(\?(.*))?)?(\s)*(?!.)
Seeing your input examples, this seems to be closer to what you actually want to match.
Then to answer your question. What (?!.) does depends on whether you use singleline mode or not. If you do, it asserts that you have reached the end of the string. In this case you can simply replace it by \Z, which always matches the end of the string. If you do not, then it asserts that you have reached the end of a line. In this case you can use $ but you need to also use multi-line mode, so that $ matches line-endings, too.
So, if you use singleline mode (which probably means you have only one URL per string), use this:
test.com(\/(index_[^?]*\.php)?(\?(.*))?)?(\s)*\Z
If you do not use singleline mode (which probably means you can have multiple URLs on their own lines), you should also use multiline mode and this kind of anchor instead:
test.com(\/(index_[^?]*\.php)?(\?(.*))?)?(\s)*$