Regex expression for parsing URLs my way - regex

I've a question in how to parse urls, my way.
Here's my regex expression:
[^\s]+?\.(com|net|org|edu...ALL_DOMAIN_EXTENSIONS)([^\s\w\d][^\s]{1,})?
My rationalle is that I want to accept
mail.google.com (as long as there's a .com, .net etc)
However the .com must be followed by a symbol (if any) and not alphanumeric. However in this way of checking, this url will fail
www.company.com
However I cant do a greedy repetiton to search for a .com as in this case
developer.google.com/appid=com.company.apppackage
How do I search to check for the first occurance of a '.com' without a alphanumeric character following it, yet making it optional in case its just
Google.com

Use $ as an alternative to match the end of the string.
[^\s]+?\.(com|net|org|edu...ALL_DOMAIN_EXTENSIONS)([^\s\w\d][^\s]+|$)?
BTW, trying to match all top-level domains will drive you crazy, since anyone can now register a TLD, so they change frequently.

Related

Using PCRE2 regex with repeating groups to find email addresses

I need to find all email addresses with an arbitrary number of alphanumeric words, separated through a period. To test the regex, I'm using the website https://regex101.com/.
The structure of a valid email addresses is word1.word2.wordN#word1.word2.wordN.word.
The regex /[a-zA-Z0-9.]+#[a-zA-Z0-9.]+.[a-zA-Z0-9]+/gm finds all email addresses included in the document string, but also includes invalid addresses like ........#....com, if present.
I tried to group the repeating parts by using round brackets and a Kleene star, but that causes the regex engine to collapse.
Invalid regex:
/([a-zA-Z0-9]+.?)*[a-zA-Z0-9]+#([a-zA-Z0-9]+.?)*[a-zA-Z0-9]+.[a-zA-Z0-9]+/gm
Although there are many posts concerning regex groups, I was unable to find an explanation, why the regex engine fails. It seems that the engine gets stuck, while trying to find a match.
How can I avoid this problem, and what is the correct solution?
I think the main issue that caused you troubles is:
. (outside of []) matches any character,you probably meant to specify \. instead (only matches literal dot character).
Also there is no need to make it optional with ?, because the non-dot part of your regex will just match with the alphanumerical characters anyway.
I also reduced the right part (x*x is the same as x+), added a case-insensitive flag and ended up with this:
/([a-z0-9]+\.)*[a-z0-9]+#([a-z0-9]+\.)+[a-z0-9]+/gmi

Regex for Google Analytics Goals

I've searched all the other Regex on Google Analytics questions but I can't use the answers as this is pretty specific to my problem.
I want to set a goal but use Regex to flag it as a goal IF string includes
/client-thank-you/ AND anything EXCEPT hire
so in other words
/client-thank-you/hire is not correct
/client-thank-you/anything/else is correct
Each of the following regexes will match any string that contains /client-thank-you/ and does not contain hire, depending on what assumption(s) you make about where "hire" is in the string.
Solution
Where can "hire" be located in the string?
Anywhere:
((?!hire).)*?/client-thank-you/((?!hire).)*
Only following the "/client-thank-you/":
.*?/client-thank-you/((?!hire).)*
Only immediately following the "/client-thank-you/":
.*?/client-thank-you/(?!hire).*
Notes
Optimization:
Each of these regexes will match the entire string. If your tool lets you determine if a string contains a substring match (rather than naively attempting to match the entire string), then you could optimize the second and third regexes by removing the leading .*?. Likewise, the third regex could be further optimized by removing the trailing .* as well.
Positively require "anything":
Note that all of these regexes assume that a string that ends with "/client-thank-you/" (with nothing after it) is valid. If this assumption is incorrect (i.e. the string .*/client-thank-you/$ is not a match), then change the trailing * on every regex to +. This would also mean that you have to keep the last .* on the third regex as a .+ (i.e. don't optimize that away).
EDIT:
The above will not work since GA uses a very limited version of regex (that does not include lookaround). If there is no other GA tool (other than a single regex) that you can use that meets your needs, then you could use the following as a last-ditch effort:
([-._~!$&'()*+,;=:#/0-9A-Za-gi-z]|h[-._~!$&'()*+,;=:#/0-9A-Za-hj-z]|hi[-._~!$&'()*+,;=:#/0-9A-Za-qs-z]|hir[-._~!$&'()*+,;=:#/0-9A-Za-df-z]|.{1,3}$)
And in expanded form for illustration purposes only:
( | | | | )
[-._~!$&'()*+,;=:#/0-9A-Za-gi-z] h[-._~!$&'()*+,;=:#/0-9A-Za-hj-z] hi[-._~!$&'()*+,;=:#/0-9A-Za-qs-z] hir[-._~!$&'()*+,;=:#/0-9A-Za-df-z] .{1,3}$
This regex will match 1-4 characters that do not form "hire". It does so by matching the minimum number of characters necessary to verify that the match is neither "hire" nor can serve as a prefix of "hire". It takes into account end-of-line (e.g. "hir" is valid if there is nothing else after it). The characters that it matches are all valid characters that can occur in the path component of a URL as specified in RFC 3986.
You use this regex by substituting it for every ((?!hire).) in any of the solutions given above. For example:
.*?/client-thank-you/([-._~!$&'()*+,;=:#/0-9A-Za-gi-z]|h[-._~!$&'()*+,;=:#/0-9A-Za-hj-z]|hi[-._~!$&'()*+,;=:#/0-9A-Za-qs-z]|hir[-._~!$&'()*+,;=:#/0-9A-Za-df-z]|.{1,3}$).*
This matches any url that contains "/client-thank-you/" but not "/client-thank-you/hire".
Do be careful, though. Doubled "h"s will make this workaround fail (e.g. "hhire"). However, if "hire" will only ever follow a path delimiter (i.e. /hire/), then that shouldn't be a problem.
If you can't use a lookahead like Travis suggested, then I suggest setting the goal to fire on an event instead of a pageview.
If you're using Google Tag Manager, you'll have the ability to write a more advanced regex, or at least set a blocking rule for the event that prevents it from firing when 'hire' is in the page URL.

Mistaken Squid Proxy regex? → ^.*stackoverflow\.*

I have several proxy rule files for Squid, and all contain rules like:
acl blacklisted dstdom_regex ^.*facebook\.* ^.*youtube\.* ^.*games.yahoo.com\.*
The patterns match against the domain name: dstdom_regex means destination (server) regular expression pattern matching.
The objective is to block some websites, but I don't know by what method: domain name, keywords in the domain name, ...
Let's expand/describe the pattern:
^.*stackexchange\.* The whole pattern
^ String beginning
.* Match anything (greedy quantifier, I presume)
stackexchange Keyword to match
\.* Any number of dots (.)
Totally legitimate matches:
stackexchange.com: The Stack Exchange website.
stackoverflow.stackexchange: The imaginary Stack Exchange gTLD.
But these possible matches make it seem more like a keyword block:
stackexchange
stackexchanger
notstackexchange
not-stackexchange
some-website.stackexchange
some-website.stackexchange-tld
And the pattern seems to contain a bug, since it allows the following invalid cases to match, thanks to the \.* at the end, although they never naturally occur:
stackexchange.
stackexchange...
stackexchange..........
stackexchange.......com
stackexchange.com
stackexchangecom
you get the idea.
Anything containing stackexchange, even if separated by dots from everything else, is still a valid match.
So now, the question itself:
This all means that this is simply a match for stackexchange! (I'm assuming the original author didn't intend to match infinite dots.)
So why not just use the pattern stackexchange? Wouldn't it be faster and give the same results, except for the "bug" (\.*)?
I.e., isn't ^.*stackexchange equivalent to stackexchange?
Edit: Just to clarify, I didn't write those proxy rule files.
I don't understand why you use \.* to match all the following dots
However to bypass your problem you can try this out :
^[^\.]*\.stackexchange\.*
[^\.]* matches anything except a dot
\. then you match the dot
edit : formatting

Regex captures all occurrences but the last of certain characters

I want to exclude common punctuation from my URL Regex detector when my clients type a sentence with a URL in it. A common scenario would be the URL example.com?q=this (which obviously needs to include the ?) versus a sentence saying
What do you think of example.com?
This expression suits my needs just fine:
(?:https?\:\/\/)?(?:\w+\.)+\w{2,}(?:[?#/]\S*)?
However it includes all punctuation at the end, so I am iterating through each match to find and use this captured group to exclude said punctuation:
(.*?)[?,!.;:]+$
However, I'm not sure how to leverage the "end of string" technique when scanning the entire block of text which may have multiple URLs. Was hoping there'd be a way to capture the right blocks from the get-go without the extra work.
Just require non-whitespace after the punctuation instead of making it optional.
(?:https?\:\/\/)?(?:\w+\.)+\w{2,}(?:[?#\/]\S+)?
You will of course lose valid ending of URLs like example.com/ will become example.com but as far as I know there is no difference.

Regex to "ignore" not "exclude"

I'm totally lost. I need a regular expression that
can detect any of the 4 starting urls like below
^(.*http://.*|.*http%3A%2F%2F.*|.*https://.*|.*https%3A%2F%2F.*)$
And ... .
should detect:
(any punctuation or space or backspace)(3 times the letter w in upper or lower case)(one dot)(anything)
And ... . which is important
Should Ignore, but NOT Exclude... . the following exact string (either it's present in the page or not)
http://www.w3.org
Which is complicated for me, because i still need to include it in the regex line
even if it's ignored, otherwise, it will match & be found in
(.*http://.*|.*http%3A%2F%2F.*|.*https://.*|.*https%3A%2F%2F.*)
And my aim is to find/match any url besides
http://www.w3.org
even if it's in the page, Or if it's not present.
so if there's only this in the page:
http://www.w3.org
& no other url.. then it shouldn't match.
Thanks Tyler but my regex knowledge is almost zero, i can only know what commands do when i right click on them to chose actions like in regulazy or regexr ((
So i updated my command according to the url i provided to you:
href%3D%22http%3A%2F%2Fwww%2Edommermuth%2D1%2Ecom
& it works:
https?(://|%3A%2F%2F)(?!www.w3.org)(.*)
But because of my lack of knowledge, i don't understand how to do that below
"What you could do is make the http part optional, or must match http or www or both. This type of regex came up in another question I answered recently - Multiple preg_replace RegEx for different URLs"
I tried to add this, but it doesn't work:
(www.)
All i'm missing now is detection of urls starting with www
(any punctuation or space or backspace)(3 times the letter w in upper or lower case)(one dot)(anything till it reaches a space or the end of a line)
OK so try this:
/\bhttps?(://|%3A%2F%2F)(?!www\.w3\.org)(.*)\b/g
Test here: http://regexr.com?38jp5
That test link uses javascript-style regex, but should work elsewhere.
The important part is the second half - a negative lookahead, that checks what follows is not the exact text www.w3.org
I compressed what you had: mine matches http then an optional s then either :// or %3A%2F%2F.
I wrapped the whole thing in word boundaries, you could change that to quotes or whatever you need. The global flag lets you match multiple items.
In regards to OP's questions:
D%22
could appear before http or https
this one is missing & should match:
href%3D%22http%3A%2F%2Fwww%2Edommermuth%2D1%2Ecom
If this matters, just remove the word boundary \b before and after the regex, so the http can match anywhere.
The regex command should detect: (any punctuation or space or backspace)(3 times the letter w in upper or lower case)(one dot)(anything)
This regex would fail to match a link like http://google.com - looking for www is really not a good way to check for a link on its own. What you could do is make the http part optional, or must match http or www or both. This type of regex came up in another question I answered recently - Multiple preg_replace RegEx for different URLs
Edit #2:
(any punctuation or space or backspace)(3 times the letter w in upper or lower case)(one dot)(anything till it reaches a space or the end of a line)
As I mention above, what you are describing will not match a url like http://google.com - but if that is what you want, use this:
(\W|^)[wW]{3}\.[^\s$]+
Instead of that, what I think you want is this, which is a combination of my first answer, and the link to a different post above.
((https?(://|%3A%2F%2F))(www\.)|(https?(://|%3A%2F%2F))|(www\.))(?!(www\.)?w3\.org)([^</\?\s]+)[^<\s]*
You'll want to use this regex with the Global and Insensitive flags