Regex for matching complete substring - regex

Sorry for the dummy question but I'm a regular expressions newbie.
I want these matches:
MATCH! http://www.google.com/search?q=...
NO MATCH http://www.googledummy.com/search?q=...
MATCH! http://www.google.it/search?q=...
NO MATCH! http://www.google.it/
NO MATCH! http://www.google.it/foobar
MATCH! google.it/search?q=...
MATCH! google.xxxxx/search?q=...
Should my regex be something like this?
google.[*$]/search

You probably want something like this:
^(?:https?://)?(?:[^.\s]+\.)*google(\.\w+){1,2}/search\?q=
This regex allows:
^ - match from the start - do not allow partial matching of the domain.
(?:https?://)? - http or https protocol.
(?:[^.]+\.)* - sub domains, but not other characters: hello.google.com is OK.
google
Does not allow:
http://notgoogle.com/search?q=
http://example.com?google.com/search?q=
Problems:
(\.\w+){1,2} - allows google.co.il, but also google.hackers.com. It's problematic unless you want to white list all two-word tlds.
the q query parameter may not be the first one (though, maybe that is one of the requirements).
\w may not fit all characters that are valid in top level domains (though google is not likely to buy google.קום)
Example: http://rubular.com/r/Avd5RFs3oH
Conclusion - If at all applicable, use a URL parser :)

from what you wrote I would say
google\.[a-z]+\/search
whether you should use \/ or just / before search depends on the language you are using.
As SeRPRo this does not work for google.co.uk, to make it working with it you can use:
google\.[a-z]+(?:\.[a-z])?\/search
(is there any country that requires a third level?)

This one works:
google\.[a-zA-Z\.]+/(search\W.+)
Example

You might want the following:
google\.[a-zA-Z.]+/search
Both other answers should work fine until you meet a second-level google site like google.com.ua

Related

Regex: how can exclude all TLD except my own domain

I have an Asp.Net website on which I'm implementing some basic anti-spam stuff via the validation controls.
One such regex is: "^(?!.*(//|[.]({com|net|info|uk|etc}))).*$"
It pretty much does what it needs to as far as blocking goes — it doesn't need to be too sophisticated. However, I want to include the option to whitelist my own domain.
So, I want to block all .uk domains, except mydomain.co.uk.
This is a regex step beyond me — can anyone help?
You may use a nested negative lookbehind while uk to fail the already existing negative lookahead for mydomain.co. part before matching uk:
^(?!.*(//|[.](com|net|info|(?<!mydomain\.co\.)uk))).*$
RegEx Demo
Take note of (?<!mydomain\.co\.) which is a negative lookbehind to not to match uk if it is preceded by mydomain.co..

regular expressions: catch any URLs of the domain example.com

I'm trying to get regexp code for the below case. I tried multiple tries but in vain.
I need to catch any URLs of the domain site.com. Tried using regexp '^site.com/*$
but it does not recognizes it.
i'm just looking for regexp code whichmatches site.com/*
With your expression ^site.com/*$ you match all strings that start with site.com and have zero or more trailing / characters (/*):
If you want to match any strings starting with site.com/ you might want to try ^site\.com/.*$:
There are already a lot of other regex questions regarding domain names on SO, but your question is not clear to me in what context you are trying to do this, or what is the actual goal you want to achieve. If you describe your needs more precisely you could probably find some answers on this forum.
I generally use a helper website like regex101.com.
Also, a few things to note, . has a special meaning in regex meaning any character, and if you wanted to capture site.com/foo you might want to use something where you are not limited to the number of characters by the end. I'd do this with groupings.
^(site\.com\/)(.+)$
You can see this in action here: https://regex101.com/r/AU2iYC/2
Your regex ^site.com/*$ is only matched follow sentences
ex) site.com/ site.com//////// site.com
because * asterisk in regex means Match 0 or more of the preceding token.
so, it should be work
^site.com\/.*$

Regex on domain and negation against language subfolders

Let's say my domains are:
www.test.com
www.test.com/en-gb
www.test.com/cn-cn
These are language sites, the first is the main US English site. In Google Analytics I want to set up a filter to only show me traffic of the first (US) domain. I could do this, I think:
^\/(en-gb|cn-cn).*$
If I EXCLUDE my Request URI with that filter pattern, then I should get a view for the en-US domain. However, I'm interested in understanding regex better so here is some test data and code which I am trying out on http://www.regextester.com/
Regular expression:
^\/(en-gb|cn-cn).*$
Test String
/cn-cn/about
/cn-cn/about/
/cn-cn
/cn-cn/about/test
/en-gb/
/en-gb
/en-gb-test/
/en-gb/aboutus/
/en-gb?q=1
/en-gb/?q=1
/about-us
/test?q=1
/aword/me/
/three
/about/en-gb/
/about/en-gb-test/
/test-yes/
/test/me/
/hello/world/
My questions:
If you try this out, you'll notice that /en-gb-test/ is actually matched with the Regex. How do I avoid this?
Also, let's say I wanted to have a rule to NEGATE this whole option. So rather than telling Google Analytics to "exclude", I am curious how I could write the opposite of this same rule. So basically, catch all URLs that are not in /en-gb and /cn-cn sub-folders.
Thanks in advance!
You may stop the regex from matching en-gb-test by making sure you may / or ? after it or the end of the string
^\/(en-gb|cn-cn)([\/?]|$)
See the regex demo. If you really need to get the rest of the string, add .* after [\/?]: ^\/(en-gb|cn-cn)([\/?]|$).
Details:
^ - start of string
\/ - a / (note that you do not need to escape / in GA regex)
(en-gb|cn-cn) - a capturing group with 2 alternatives, either en-gb or cn-cn
([\/?]|$) - a capturing group with two alternatives: a ? or / OR the end of the string.
In RE2 regex, you cannot use lookaheads that are crucial when you need to match something other than something else. It would look like ^(?!\/(en-gb|cn-cn)([\/?]|$)).*, but it is not possible with RE2.

regex to find domain without those instances being part of subdomain.domain

I'm new to regex. I need to find instances of example.com in an .SQL file in Notepad++ without those instances being part of subdomain.example.com(edited)
From this answer, I've tried using ^((?!subdomain))\.example\.com$, but this does not work.
I tested this in Notepad++ and # https://regex101.com/r/kS1nQ4/1 but it doesn't work.
Help appreciated.
Simple
^example\.com$
with g,m,i switches will work for you.
https://regex101.com/r/sJ5fE9/1
If the matching should be done somewhere in the middle of the string you can use negative look behind to check that there is no dot before:
(?<!\.)example\.com
https://regex101.com/r/sJ5fE9/2
Without access to example text, it's a bit hard to guess what you really need, but the regular expression
(^|\s)example\.com\>
will find example.com where it is preceded by nothing or by whitespace, and followed by a word boundary. (You could still get a false match on example.com.pk because the period is a word boundary. Provide better examples in your question if you want better answers.)
If you specifically want to use a lookaround, the neative lookahead you used (as the name implies) specifies what the regex should not match at this point. So (?!subdomain\.)example trivially matches always, because example is not subdomain. -- the negative lookahead can't not be true.
You might be better served by a lookbehind:
(?<!subdomain\.)example\.com
Demo: https://regex101.com/r/kS1nQ4/3
Here's a solution that takes into account the protocols/prefixes,
/^(www\.)?(http:\/\/www\.)?(https:\/\/www\.)?example\.com$/

regex for several domains

I am using a regular expression to determine when to fire a tracking tag or not.
If a visitor to one of the sites is on one of these three domains the tag should fire:
- www.grousemountainlodge.com
- www.glacierparkinc.com
- reserveglacierdenali.com
I actually have a regular expression that works. But I'm not confident and wanted to bounce it off the folk on this board.
This is what I have. Is there a simpler, more elegant or more robust regex to use for matching the 3 domains?
^(www\.)?((glacierparkinc|grousemountainlodge)\.com)$|(^reserveglacierdenali\.com)$
Following some answers, this regex should exlude other domains e.g. cats.glacierparkinc.com or similar.
I'm not sure whether glacierparkinc.com should match, without the www. prefix - from your list it seems that no, but from your regex it seems it will be matched.
In either case I guess you can simplify it a bit:
^(?:www\.(?:glacierparkinc|grousemountainlodge)|reserveglacierdenali)\.com$
Note the use of (?:) instead of just (): this means positive look-ahead assertion without capturing. Its a best practice not to capture when you don't need to - saving time and memory.
It must be at starting position with or not www.. So:
^(?:www\.)?(?:glacierparkinc|grousemountainlodge|reserveglacierdenali)\.
If it maches, then do something.
Regex live here.
Hope it helps.