regex for several domains - regex

I am using a regular expression to determine when to fire a tracking tag or not.
If a visitor to one of the sites is on one of these three domains the tag should fire:
- www.grousemountainlodge.com
- www.glacierparkinc.com
- reserveglacierdenali.com
I actually have a regular expression that works. But I'm not confident and wanted to bounce it off the folk on this board.
This is what I have. Is there a simpler, more elegant or more robust regex to use for matching the 3 domains?
^(www\.)?((glacierparkinc|grousemountainlodge)\.com)$|(^reserveglacierdenali\.com)$
Following some answers, this regex should exlude other domains e.g. cats.glacierparkinc.com or similar.

I'm not sure whether glacierparkinc.com should match, without the www. prefix - from your list it seems that no, but from your regex it seems it will be matched.
In either case I guess you can simplify it a bit:
^(?:www\.(?:glacierparkinc|grousemountainlodge)|reserveglacierdenali)\.com$
Note the use of (?:) instead of just (): this means positive look-ahead assertion without capturing. Its a best practice not to capture when you don't need to - saving time and memory.

It must be at starting position with or not www.. So:
^(?:www\.)?(?:glacierparkinc|grousemountainlodge|reserveglacierdenali)\.
If it maches, then do something.
Regex live here.
Hope it helps.

Related

regex to find domain without those instances being part of subdomain.domain

I'm new to regex. I need to find instances of example.com in an .SQL file in Notepad++ without those instances being part of subdomain.example.com(edited)
From this answer, I've tried using ^((?!subdomain))\.example\.com$, but this does not work.
I tested this in Notepad++ and # https://regex101.com/r/kS1nQ4/1 but it doesn't work.
Help appreciated.
Simple
^example\.com$
with g,m,i switches will work for you.
https://regex101.com/r/sJ5fE9/1
If the matching should be done somewhere in the middle of the string you can use negative look behind to check that there is no dot before:
(?<!\.)example\.com
https://regex101.com/r/sJ5fE9/2
Without access to example text, it's a bit hard to guess what you really need, but the regular expression
(^|\s)example\.com\>
will find example.com where it is preceded by nothing or by whitespace, and followed by a word boundary. (You could still get a false match on example.com.pk because the period is a word boundary. Provide better examples in your question if you want better answers.)
If you specifically want to use a lookaround, the neative lookahead you used (as the name implies) specifies what the regex should not match at this point. So (?!subdomain\.)example trivially matches always, because example is not subdomain. -- the negative lookahead can't not be true.
You might be better served by a lookbehind:
(?<!subdomain\.)example\.com
Demo: https://regex101.com/r/kS1nQ4/3
Here's a solution that takes into account the protocols/prefixes,
/^(www\.)?(http:\/\/www\.)?(https:\/\/www\.)?example\.com$/

RegEx match all website links except those containing admin

I'm setting up URL Rewrite on an IIS and i need to match the following URLs using regex.
http://sub.mysite.com
sub.mysite.com
sub.mysite.com/
sub.mysite.com/Site1
sub.mysite.com/Site1/admin
but not:
sub.mysite.com/admin
sub.mysite.com/admin/somethingelse
sub.mysite.com/admin/admin
The site it self (sub.mysite.com) should not be "hardcoded" in the expression. Instead, it should be matched by something like .*.
I'm really blank on this one. I did find solutions to match the different URLs but once i try to combine them either none of them match or all of them do.
I hope someone can help me.
For your specific case, assuming you are matching the part after the domain (REQUEST_URI):
(?!/admin).*
(?!...) is a negative lookahead. I am not sure if it is supported in the IIS URL Rewrite engine. If not, a better approach would be to check for a complementary approach:
Or as #kirilloid said, just match /admin/? and discard (pay attention to slashes).
BTW. if you want to quickly test RegExps with a "visual" feedback, I highly recommend http://gskinner.com/RegExr/
([A-Za-z0-9]+.)+.com(?!/admin)/?([A-Za-z0-9]+/?)*
this should do the trick

Regex for matching complete substring

Sorry for the dummy question but I'm a regular expressions newbie.
I want these matches:
MATCH! http://www.google.com/search?q=...
NO MATCH http://www.googledummy.com/search?q=...
MATCH! http://www.google.it/search?q=...
NO MATCH! http://www.google.it/
NO MATCH! http://www.google.it/foobar
MATCH! google.it/search?q=...
MATCH! google.xxxxx/search?q=...
Should my regex be something like this?
google.[*$]/search
You probably want something like this:
^(?:https?://)?(?:[^.\s]+\.)*google(\.\w+){1,2}/search\?q=
This regex allows:
^ - match from the start - do not allow partial matching of the domain.
(?:https?://)? - http or https protocol.
(?:[^.]+\.)* - sub domains, but not other characters: hello.google.com is OK.
google
Does not allow:
http://notgoogle.com/search?q=
http://example.com?google.com/search?q=
Problems:
(\.\w+){1,2} - allows google.co.il, but also google.hackers.com. It's problematic unless you want to white list all two-word tlds.
the q query parameter may not be the first one (though, maybe that is one of the requirements).
\w may not fit all characters that are valid in top level domains (though google is not likely to buy google.קום)
Example: http://rubular.com/r/Avd5RFs3oH
Conclusion - If at all applicable, use a URL parser :)
from what you wrote I would say
google\.[a-z]+\/search
whether you should use \/ or just / before search depends on the language you are using.
As SeRPRo this does not work for google.co.uk, to make it working with it you can use:
google\.[a-z]+(?:\.[a-z])?\/search
(is there any country that requires a third level?)
This one works:
google\.[a-zA-Z\.]+/(search\W.+)
Example
You might want the following:
google\.[a-zA-Z.]+/search
Both other answers should work fine until you meet a second-level google site like google.com.ua

Regex Problem (newbie)

i'm writing a little app for spam-checking and i'm having problems with a regex.
let's say i'm having this spam-url:
http://hosting.tyumen.ru/tip.html
so i want to check its url for having 2 full stops (subdomain+ending), a slash, a word, full stop and "html".
here's what i got so far:
(http://.*?\..*?..*?/.*?.html)
might look like rubbish but it works - the problem: it's really slow and freezing my app.
any hints on how to optimize it?
thx.re
The reason it's slow is that the non-greedy operators ? being used this way is prone to catastrophic backtracking
Instead of saying "any amount of anything, but only to an extent where it doesn't conflict with later requirements", which is effectively what .*? is saying, try asking for "as much as possible, that isn't a double quote, which would terminate the href ":
\1
I also added a back-reference (\1) to your first capturing group, inside the <a>...</a>, so that you don't have to do the exact same matching all over again.
Note that this regex will be broken if, say, the a has a class name, an id, or anything else in its body. I left it like this because I wanted to give you what you asked for with as few changes as possible, and as to-the-point as possible.
(http://[\w.-]+/.+?\.html) - may be will work for your case only.
or may be faster one
(http://[\w.-]+/[^.]+\.html)
Since you claim to be a regexp newbie, I will offer a more general advice on creating and debugging regular expressions. When they get pretty complicated, I find using Regexp Coach a must.
It's a freeware and really saves a lot of headache. Not to mention you don't have to build / run your application every minute just to see if the regexp works the way you wanted.
In Python, a simple way to match URLs ending in .html or .htm is to use
url_re = re.compile(
r'https?://' # http:// or https://
r'(?:(?:[A-Z0-9](?:[A-Z0-9-]{0,61}[A-Z0-9])?\.)+[A-Z]{2,6}\.?|' #domain...
r'localhost|' #localhost...
r'\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})' # ...or ip
r'(?::\d+)?' # optional port
r'(?:\S+.html?)+' # ending in .html
, re.IGNORECASE)
which is a modified version of Django's UrlField regex.
This will match any site ending with .html or .htm. (either localhost, ip, domain).
#http://[-a-zA-Z0-9]+\.[-a-zA-Z0-9]+\.[-a-zA-Z]+/\w+\.html#

Regex to extract part of a url

I'm being lazy tonight and don't want to figure this one out. I need a regex to match 'jeremy.miller' and 'scottgu' from the following inputs:
http://codebetter.com/blogs/jeremy.miller/archive/2009/08/26/talking-about-storyteller-and-executable-requirements-on-elegant-code.aspx
http://weblogs.asp.net/scottgu/archive/2009/08/25/clean-web-config-files-vs-2010-and-net-4-0-series.aspx
Ideas?
Edit
Chris Lutz did a great job of meeting the requirements above. What if these were the inputs so you couldn't use 'archive' in the regex?
http://codebetter.com/blogs/jeremy.miller/
http://weblogs.asp.net/scottgu/
Would this be what you're looking for?
'/([^/]+)/archive/'
Captures the piece before "archive" in both cases. Depending on regex flavor you'll need to escape the /s for it to work. As an alternative, if you don't want to match the archive part, you could use a lookahead, but I don't like lookaheads, and it's easier to match a lot and just capture the parts you need (in my opinion), so if you prefer to use a lookahead to verify that the next part is archive, you can write one yourself.
EDIT: As you update your question, my idea of what you want is becoming fuzzier. If you want a new regex to match the second cases, you can just pluck the appropriate part off the end, with the same / conditions as before:
'/([^/]+)/$'
If you specifically want either the text jeremy.miller or scottgu, regardless of where they occur in a URL, but only as "words" in the URL (i.e. not scottgu2), try this, once again with the / caveat:
'/(jeremy\.miller|scottgu)/'
As yet a third alternative, if you want the field after the domain name, unless that field is "blogs", it's going to get hairy, especially with the / caveat:
'http://[^/]+/(?:blogs/)?([^/]+)/'
This will match the domain name, an optional blogs field, and then the desired field. The (?:) syntax is a non-capturing group, which means it's just like regular parenthesis, but won't capture the value, so the only value captured is the value you want. (?:) has a risk of varying depending on your particular regex flavor. I don't know what language you're asking for, but I predominantly use Perl, so this regex should pretty much do it if you're using PCRE. If you're using something different, look into non-capturing groups.
Wow. That's a lot of talking about regexes. I need to shut up and post already.
Try this one:
/\/([\w\.]+)\/archive/