Test but not select with regex - regex

I am trying to test for an expression but not to select it.
I need that for selecing custom TODOs in the IDE Pycharm.
I want to select comments that have the word to-cleanup in them.
When I do the following: # \b.*to-cleanup\b.* it also selects the #. I'm pretty sure there must be a way to test for the existence of # but not to select it.
I just read the documentation for regex that Pycharm Help has, so I don't know how to do it. Any help would be greatly appreciated!
I checked here, but couldn't understand how to fit it into what I need.

You can use a positive lookbehind:
(?<=#) .*\bto-cleanup\b.*
The regex matches (see demo):
(?<=#) - a space preceded with a # symbol
.* - 0 or more characters other than a newline up to the last
\bto-cleanup\b - whole word to-cleanup
.* - 0 or more characters other than a newline (up to the end of the line).
This lookbehind is fixed-width and only checks if the space is preceded with # while the # itself is not part of the match.
See lookarounds details at regular-expressions.info

Related

Regex - Exclude URLs by keywords

I am trying to use StackPath's EdgeRules and their documentation is not very clear or good.
I need to match urls in multiple directories but exclude any URL's that have the extension m3u8 in it or the word segment in it. This is their docs EdgeRules
This works to limit it to 2 directories.
/(https://example.com(/(pics|vids)/).*)/
But then this doesn't work.
/(https://example.com(/(pix|vids)/).+(?!m3u8|segment).*)/
I've been trying to use https://regex101.com/ but nothing I try seems to work. I don't even know what kind of regex they use. Hopefully can get some help with this.
I can't test this so apologies if its something else wrong...
The negative look aheads need to be side by side, not wrapped in parentheses separated by or (|). I also added a end of line character ($) at the end of .m3u8.
(https://example.com(/(pix|vids)/)(?!.*\.m3u8$)(?!.*segment.*).*)
See this example:
https://regex101.com/r/reVHWt/1
The EdgeRules docs do not mention the regex favor they support, and from the examples it is not clear. Also the example /(^http://example.com(/.*/)+.$)/ shows non-escaped backslashes, indicating this is non-standard regex.
I see no other way than using a negative lookahead to exclude arbitrary patterns. Assuming their regex does support it you can try:
/^https://example.com/(pix|vids)/(?!.*\bm3u8\b)(?!.*\bsegment\b).*$/
Or with properly escaped special chars:
/^https:\/\/example\.com/(pix|vids)/(?!.*\bm3u8\b)(?!.*\bsegment\b).*$/
Explanation of regex:
^ -- anchor at start of string
https:\/\/example\.com/ -- literal https://example.com/
(pix|vids) -- literal pix or vids
/ -- slash
(?!.*\bm3u8\b) -- negative lookahead for m3u8, anchored on both sides with \b
(?!.*\bsegment\b) -- ditto for segment
.*$ -- any other chars up to end of string

Regex Email validation with some special cases [duplicate]

I am trying to make a regex match which is discarding the lookahead completely.
\w+([-+.]\w+)*#\w+([-.]\w+)*\.\w+([-.]\w+)*
This is the match and this is my regex101 test.
But when an email starts with - or _ or . it should not match it completely, not just remove the initial symbols. Any ideas are welcome, I've been searching for the past half an hour, but can't figure out how to drop the entire email when it starts with those symbols.
You can use the word boundary near # with a negative lookbehind to check if we are at the beginning of a string or right after a whitespace, then check if the 1st symbol is not inside the unwanted class [^\s\-_.]:
(?<=^|\s)[^\s\-_.]\w*(?:[-+.]\w+)*\b#\w+(?:[-.]\w+)*\.\w+(?:[-.]\w+)*
See demo
List of matches:
support#github.com
s.miller#mit.edu
j.hopking#york.ac.uk
steve.parker#soft.de
info#company-hotels.org
kiki#hotmail.co.uk
no-reply#github.com
s.peterson#mail.uu.net
info-bg#software-software.software.academy
Additional notes on usage and alternative notation
Note that it is best practice to use as few escaped chars as possible in the regex, so, the [^\s\-_.] can be written as [^\s_.-], with the hyphen at the end of the character class still denoting a literal hyphen, not a range. Also, if you plan to use the pattern in other regex engines, you might find difficulties with the alternation in the lookbehind, and then you can replace (?<=\s|^) with the equivalent (?<!\S). See this regex:
(?<!\S)[^\s_.-]\w*(?:[-+.]\w+)*\b#\w+(?:[-.]\w+)*\.\w+(?:[-.]\w+)*
And last but not least, if you need to use it in JavaScript or other languages not supporting lookarounds, replace the (?<!\S)/(?<=\s|^) with a (non)capturing group (\s|^), wrap the whole email pattern part with another set of capturing parentheses and use the language means to grab Group 1 contents:
(\s|^)([^\s_.-]\w*(?:[-+.]\w+)*\b#\w+(?:[-.]\w+)*\.\w+(?:[-.]\w+)*)
See the regex demo.
I use this for multiple email addresses, separate with ‘;':
([A-Za-z0-9._%-]+#[A-Za-z0-9.-]+\.[A-Za-z]{2,4};)*
For a single mail:
[A-Za-z0-9._%-]+#[A-Za-z0-9.-]+\.[A-Za-z]{2,4}

Regex lookahead. Find word without .min. in string

I'm trying to replace a link in a html file with regex and nodejs. I want to replace links without a .min.js extension.
For example, it should match "common.js" but not "common.min.js"
Here's what I've tried:
let htmlOutput = html.replace(/common\.(?!min)*js/g, common.name);
I think this negative lookahead should work but it doesn't match anything. Any help would be appreciated.
The (?!min)*js part is corrupt: you should not quantify zero-width assertions like lookaheads (they do not consume text so quantifiers after them are treated either as user errors or are ignored). Since js does not start with min this lookahead even without a quantifier is redundant.
If you want to match a string with a whole word common, then having any chars and ending with .js but not .min.js you need
/\bcommon\b(?!.*\.min\.js$).*\.js$/
See the regex demo.
Details:
\b - word boundary
common - a substring
\b - word boundary
(?!.*\.min\.js$) - immediately to the right, there should not be any 0 or more chars followed with .min.js at the end of the string
.* - any 0 or more chars
\.js - a .js substring
$ - end of string.
Here, we likely can find a simple expression to pass any char except new lines and ., after the word common, followed by .js:
common([^\.]+)?\.js
Demo
RegEx Circuit
jex.im visualizes regular expressions:
The end regex I'm using is /\bcommon[^min]+js\b/g
This will find the word common with any amount of chracters afterword except if those characters contain the word minand ending in js allowing me to replace scripts on my html page like:
script src="~/dist/common.js"
OR
script src="~/dist/common.9cf5748e0e7fc2928a07.js"
Thanks to Wiktor Stribiżew for helping me.

How to end a string with $ directly after .* with a RegEx?

I'm trying to report on a set of URLs that catches all potential URL parameters and I'm having an issue defining the RegEx properly.
We have this RegEx to capture a few variations of our URLs to feed into our reporting but I need to be able to end the string with a $ but when I do, it doesn't show any results.
The RegEx:
/join/$|/join/\?product.*|/join/\.*
For another account, we only use one variation which is outlined below (which works):
^/join/$
I believe the issue is in that after \?product.*, I'm not ending the string (or even starting it).
So far I have tried: ^/join/$|(^[/join/\?product.*]$)|(^[/join/\.*]$) with no luck.
If you want to match the dollar sign literally you have to escape it \$ or else it would mean an anchor to assert the end of the string / line.
This pattern ^/join/$ would therefore only match /join/
In your pattern you use an alternation where the last part /join/\.* would match /join/ but also /join/..... because when you escape the dot you will match it literally and the * quantifier repeats 0+ times.
Perhaps you are looking for:
^/join/(?:\?product.*\$)?$
This will match /join/ followed by an optional part (?:\?product.*\$)? that will match ?product, followed by any char 0+ times and will end on $.
Regex demo
Please, make the pattern lazy and $ is a special character for regex so need to escape that. (Regarding escaping part, google analytics may follow something else.) [] is used to capture a character in a range, be careful with that as well, as you are trying to capture a group I think.
\?product.*?\$

Regex - returning a match without a period

I'm using the below regex string to match the word "kohls" which is located in a group of other words.
\W*((?i)kohls(?-i))\W*
It works great when the word is alone, but if the word is in a url, the match includes a period on both sides.
See the below examples:
Thank you for shopping at Kohls - returns a match for kohls.
https://www.kohls.com - returns a match for .kohls.
Edit. https://www.KohlsAndMichaels.com - doesn't return any match for kohls.
I want it to only extract the exact match for kohls without periods or any other symbols/text in front or behind it. Can you tell me what I'm doing wrong?
In cases like that you can always use a site like regex101.com, which explains the regular expression and shows the matches with colors. So this is how your regular expression currently works:
As you can see in blue color, the problem with the dots is in the \W*, which matches any non-word character. In order to fix this, you can use the following regular expression:
\b((?i)kohls(?-i))\b
The \b (before and after the word you want to match) is used to assert the position at a word boundary. See how this work on that website now:
If you still have questions, look at the explanation of the regular expression provided by that website. It is worth looking.
The \W metacharacter is used to find non-word characters. So adding a star operator will match 0 or more of these non-word characters (like periods). Did you meant to add a word boundary instead?
\b(?i)kohls(?-i)\b
Replace both \W* with [\W,\.\-]* etc.
Should be enough.