Regex alternative to negative lookahead - regex

I want to match all paths that include the keyword build unless they also contain .html
Here is a working regex that uses negative lookahead: https://regexr.com/4msck
I am using regex for path matching in unison which does not support negative lookahead. How can I replicate the functionality of the above regex without negative lookahead?

It is possible, but the resulting regex is pretty poor in terms of readability and maintainability.
http://regexr.com/4mst1
^(?:[^\.\n]|\.(?:$|[^h\n]|h(?:$|[^t\n]|t(?:$|[^m\n]|m(?:$|[^l\n])))))*build(?:[^\.\n]|\.(?:$|[^h\n]|h(?:$|[^t\n]|t(?:$|[^m\n]|m(?:$|[^l\n])))))*$
Explanation:
^ - start of string/line
(?:[^\.\n]|\.(?:$|[^h\n]|h(?:$|[^t\n]|t(?:$|[^m\n]|m(?:$|[^l\n])))))* - matches anything that does not contain .html
build - literally that string
(?:[^\.\n]|\.(?:$|[^h\n]|h(?:$|[^t\n]|t(?:$|[^m\n]|m(?:$|[^l\n])))))* - same as before
$ - end of string/line

According to the manual, this should work. It is based on the comment: "I want to ignore all files in a build directory except for html files"
ignore = Regex .*build.*
ignorenot = Name {*.html}
I am not familiar with unison, so I must assume that you can specify the paths with more than 1 rule.
I have this expectation because of this statement in the manual:
There is also an ignorenot preference, which specifies a set of patterns for paths that should not be ignored, even if they match an ignore pattern.

Related

Regex for "starts with," "does not contain," and "ends with"

I'm trying to search for code within a WordPress site, specifically for a facebook pixel. I'm searching for strings using a regex and I know what the string starts with, ends with, and what the string should NOT contain. I have tried other solutions on SO but with no luck.
The string should start with:
fbq('track'
End with:
);
and NOT contain:
PageView
The expression that I have been playing with to try and do this search is:
^(?=^fbq('track')(?=.*\);$)(?=^(?:(?!PageView).)*$).*$/
From this other StackOverflow question:
Combine Regexp?
However, I keep getting back that this is in an invalid format.
You may use:
^(?!.*PageView)fbq\('track.*\);$
Or:
^fbq\('track(?!.*PageView).*\);$
Demo.
Breakdown:
^ - Beginning of the string.
(?!.*PageView) - Negative Lookahead (does not contain "PageView" from this point forward).
fbq\('track - Match "fbq('track", literally (notice how "(" is escabed: \().
.* - Match zero or more characters (any characters).
\); - Match ");" literally.
$ - End of string.
You can go with the first one!
I already already test it in the regex software what I use to try the "regexes" when I need to. ;)
I'm going to add my litle gain of sand :)
Here you have a good source to read the look-around and look-behind (and negative-look-behind, etc): https://www.regular-expressions.info/lookaround.html
*It contains iformation about the use and restrictions on the most used regex flavors (and it implementation in some programming languages).
First of all, if you are not able to locate the FB Pixel, check if you have Google Tag Manager on the site and perhaps it is added via GTM,
If not, then on with the RegEx...
As this is a script in a template file where it can span multiple lines and have spaces before the text etc, a more flexible pattern would be appropriate.
So the main idea is that you don't use ^ and $ in your pattern.
Example
fbq\('track'(?!.*?PageView)[^)]*\);
The pattern above satisfies the requirements you outlined in the OP, where
fbq\('track' - Literally matches fbq('track' as the start of the string
(?!.*?PageView) - Negative lookahead to fail if PageView is found, .*? is used to lazy match 0 or more characters as we would find PageView sooner than later and don't need to backtrack
As the lookahead above is 0 length, if it passed(PageView not found) the cursor will still be at the end of - fbq('track' <- Cursor here
[^)]* - Matched 0 or more characters until a closing parenthesis is found excluding it
\); - Match ); literally.
I am guessing you might be using VSCode, PhpStorm or similar so I selected JS as the flavor in the example for for compatibility.
If you are using grep say in Linux or a bash terminal on Windows(Not sure of Mac due to grep param compatibility) running this from the Theme directory should show you the files and matches.
grep -Pzro 'fbq\('\''track'\''(?!.*?PageView)[^)]*\);'

Match specific pattern that does not contain other pattern in one expression

I'm looking for a regex to use in nginx location matching, that would match a specified end pattern not being anywhere preceded by a specified other pattern.
Like, I have files:
webgl-0.4.0-alpha.1-gzip-dev/streaming-wasm-gzip-dev.wasm.framework.unityweb
webgl-0.4.0-alpha.1-gzip-dev/streaming-wasm-gzip-dev.data.unityweb
webgl-0.4.0-alpha.1-gzip/streaming-wasm-gzip.wasm.framework.unityweb
webgl-0.4.0-alpha.1-gzip/streaming-wasm-gzip.data.unityweb
I want to match all \.unityweb except those that are anywhere preceded by dev. Basically, I need to match last two lines. I cannot hardcode it, as the files/directories might be named arbitrary.
The usual ((?!dev\/).)*$ doesn't suffice, because it still gets the ends. (?<!dev) also cannot be added anwyhere as it will only match directly before.
I am out of clues and also out of regex fu!
The solution does not have to be strictly regex, might be nginx based too.
It might have been asked before, but I cannot seem to know the correct keywords to find it.
Try
^(?!.*?dev\/.*).+\.unityweb$
See the demo here
Description:
^ From the start of the line
(?! _______ ) Negative Lookahead
.*?dev\/ Match any character any amount of times, until you reach dev followed by a slash
.* Match any characters any amount of times
Negative lookahead closes
.+ Match any character, more than once
\.unityweb - until you reach .unityweb
$ End of the line
Use the full match for what you need
EDIT
Just realised that you also state a contradiction in your question, as you say you don't want to match anything preceded by dev/ but you also want to match the first two examples you gave.
That can be done by changing the negative lookahead to a positive lookahead:
^(?=.*?dev\/.*).+\.unityweb$
See the demo here
You can use this
^(?!.*dev.*\.unityweb)(?=.*\.unityweb).*$
Demo

regex to ignore lines with specific symbol

I'm trying to get a regex to set change relative paths to absolute with an alias, and ignore any lines with alias already indicated with the # symbol to prevent folder/file name matches. I've got as far as the replace and match , but I can't do the ignore lines with # bit. I would also like it to match the /foldername/ forward slashes either side when selecting.
https://regex101.com/r/vRUegE/1/
I would have expected the lines with # to be ignored
Here is the correct response thanks to Wiktor:
Working example
Using a combination of WinGrep and these regex's it was easy to refactor hundreds of paths in hundreds of files in minutes!
You may add a negative lookahead (?!#) after the positive lookbehind:
(?<=from..)(?!#)(.*)(?=module)(.*)(module)
^^^^^
See this regex demo. The (?!#) will fail the match once # is found right after from and any 2 chars immediately to the right of it.
Note that the regex might need further adjusting as (?=module) does not make much sense here. You might as well use (?<=from..)(?!#)(.*)(module).
difficult to say in which language you do it, seem java , regexp is bit different for different languages .
(?<=from..)(.*)(?=module)(.*(module)/g
here you refer (.*) - e.q. any repeat chars , you need to change to
([^\#]*) ( or ([^#]*) )
e.q. any non # char

.hgignore a folder except some subfolders

I want to ignore a folder but preserve some of its folders.
I Tried regexp matching like this
syntax: regexp
^site/customer/\b(?!.*/data/.*).*
Unfortunately this doesn't work.
I read in this answer that python only does fixed-width negative lookups.
Is my desired ignoring impossible?
Python regex is cool
Python does support negative lookahead lookups (?=.*foo). But it doesn't support arbitrary-length negative lookbehind lookups (?<=foo.*). It needs to be fixed (?<=foo..).
Which means it's definitely possible to solve your problem.
The problem
You've got the following regex: /customer/(?!.*/data/.*).*.
Let's take an input example /customer/data/name. It matches for a reason.
/customer/data/name
^^^^^^^^^^ -> /customer/ match !
^ (?!.*/data/.*) Let's check if there is no /data/ ahead
The problem is here, we've already matched "/"
so the regex only finds "data/name" instead of "/data/name"
^^^^^^^^^ .* match !
Fixing your regex
Basically we just need to remove that one forward slash, we add an anchor ^ to make sure it's the beginning of string and make sure we just match customer by using \b : ^/customer\b(?!.*/data/).*.
Online demo

Google Analytics Regex - Alternative to no negative lookahead

Google Analytics does not allow negative lookahead anymore within its filters. This is proving to be very difficult to create a custom report only including the links I would like it to include.
The regex that includes negative lookahead that would work if it was enabled is:
test.com(\/\??index\_(.*)\.php\??(.*)|\/\?(.*)|\/|)+(\s)*(?!.)
This matches:
test.com
test.com/
test.com/index_fb2.php
test.com/index_fb2.php?ref=23
test.com/index_fb2.php?ref=23&e=35
test.com/?ref=23
test.com/?ref=23&e=35
and does not match (as it should):
test.com/ambassadors
test.com/admin/?signup=true
test.com/randomtext/
I am looking to find out how to adapt my regex to still hold the same matches but without the use of negative lookahead.
Thank you!
Google Analytics doesn't seem to support single-line and multiline modes, which makes sense to me. URLs can't contain newlines, so it doesn't matter if the dot doesn't match them and there's never any need for ^ and $ to match anywhere but the beginning and end of the whole string.
That means the (?!.) in your regex is exactly equivalent to $, which matches only at the very end of the string (like \z, in flavors that support it). Since that's the only lookahead in your regex, you should never have have had this problem; you should have been using $ all along.
However, your regex has other problems, mostly owing to over-reliance on (.*). For example, it matches these strings:
test.com/?^#(%)!*%supercalifragilisticexpialidocious
test.com/index_ecky-ecky-ecky-ecky-PTANG!-vroop-boing_rowr.php (ni! shh!)
...which I'm pretty sure you don't want. :P
Try this regex:
test\.com(?:/(?:index_\w+\.php)?(?:\?ref=\d+(?:&e=\d+)?)?)?\s*$
or more readably:
test\.com
(?:
/
(?:index_\w+\.php)?
(?:
\?ref=\d+
(?:
&e=\d+
)?
)?
)?
\s*$
For illustration purposes I'm making a lot of simplifying assumptions about (e.g.) what parameters can be present, what order they'll appear in, and what their values can be. I'm also wondering if it's really necessary to match the domain (test.com). I have no experience with Google Analytics, but shouldn't the match start (and be anchored) right after domain? And do you really have to allow for whitespace at the end? It seems to me the regex should be more like this:
^/(?:index_\w+\.php)?(?:\?ref=\d+(?:&e=\d+)?)?$
Firstly I think your regex needs some fixing. Let's look at what you have:
test.com(\/\??index_.*.php\??(.*)|\/\?(.*)|\/|)+(\s)*(?!.)
The case where you use the optional ? at the start of index... is already taken care of by the second alternative:
test.com(\/index_.*.php\??(.*)|\/\?(.*)|\/|)+(\s)*(?!.)
Now you probably only want the first (.*) to be allowed, if there actually was a literal ? before. Otherwise you will match test.com/index_fb2.phpanystringhereandyouprobablydon'twantthat. So move the corresponding optional marker:
test.com(\/index_.*.php(\?(.*))?|\/\?(.*)|\/|)+(\s)*(?!.)
Now .* consumes any character and as much as possible. Also, the . in front of php consumes any character. This means you would be allowing both test.com/index_fb2php and test.com/index_fb2.html?someparam=php. Let's make that a literal . and only allow non-question-mark characters:
test.com(\/index_[^?]*\.php(\?(.*))?|\/\?(.*)|\/|)+(\s)*(?!.)
Now the first and second and third option can be collapsed into one, if we make the file name optional, too:
test.com(\/(index_[^?]*\.php)?(\?(.*))?|)+(\s)*(?!.)
Finally, the + can be removed, because the (.*) inside can already take care of all possible repetitions. Also (something|) is the same as (something)?:
test.com(\/(index_[^?]*\.php)?(\?(.*))?)?(\s)*(?!.)
Seeing your input examples, this seems to be closer to what you actually want to match.
Then to answer your question. What (?!.) does depends on whether you use singleline mode or not. If you do, it asserts that you have reached the end of the string. In this case you can simply replace it by \Z, which always matches the end of the string. If you do not, then it asserts that you have reached the end of a line. In this case you can use $ but you need to also use multi-line mode, so that $ matches line-endings, too.
So, if you use singleline mode (which probably means you have only one URL per string), use this:
test.com(\/(index_[^?]*\.php)?(\?(.*))?)?(\s)*\Z
If you do not use singleline mode (which probably means you can have multiple URLs on their own lines), you should also use multiline mode and this kind of anchor instead:
test.com(\/(index_[^?]*\.php)?(\?(.*))?)?(\s)*$