Regex on domain and negation against language subfolders - regex

Let's say my domains are:
www.test.com
www.test.com/en-gb
www.test.com/cn-cn
These are language sites, the first is the main US English site. In Google Analytics I want to set up a filter to only show me traffic of the first (US) domain. I could do this, I think:
^\/(en-gb|cn-cn).*$
If I EXCLUDE my Request URI with that filter pattern, then I should get a view for the en-US domain. However, I'm interested in understanding regex better so here is some test data and code which I am trying out on http://www.regextester.com/
Regular expression:
^\/(en-gb|cn-cn).*$
Test String
/cn-cn/about
/cn-cn/about/
/cn-cn
/cn-cn/about/test
/en-gb/
/en-gb
/en-gb-test/
/en-gb/aboutus/
/en-gb?q=1
/en-gb/?q=1
/about-us
/test?q=1
/aword/me/
/three
/about/en-gb/
/about/en-gb-test/
/test-yes/
/test/me/
/hello/world/
My questions:
If you try this out, you'll notice that /en-gb-test/ is actually matched with the Regex. How do I avoid this?
Also, let's say I wanted to have a rule to NEGATE this whole option. So rather than telling Google Analytics to "exclude", I am curious how I could write the opposite of this same rule. So basically, catch all URLs that are not in /en-gb and /cn-cn sub-folders.
Thanks in advance!

You may stop the regex from matching en-gb-test by making sure you may / or ? after it or the end of the string
^\/(en-gb|cn-cn)([\/?]|$)
See the regex demo. If you really need to get the rest of the string, add .* after [\/?]: ^\/(en-gb|cn-cn)([\/?]|$).
Details:
^ - start of string
\/ - a / (note that you do not need to escape / in GA regex)
(en-gb|cn-cn) - a capturing group with 2 alternatives, either en-gb or cn-cn
([\/?]|$) - a capturing group with two alternatives: a ? or / OR the end of the string.
In RE2 regex, you cannot use lookaheads that are crucial when you need to match something other than something else. It would look like ^(?!\/(en-gb|cn-cn)([\/?]|$)).*, but it is not possible with RE2.

Related

Regex: how can exclude all TLD except my own domain

I have an Asp.Net website on which I'm implementing some basic anti-spam stuff via the validation controls.
One such regex is: "^(?!.*(//|[.]({com|net|info|uk|etc}))).*$"
It pretty much does what it needs to as far as blocking goes — it doesn't need to be too sophisticated. However, I want to include the option to whitelist my own domain.
So, I want to block all .uk domains, except mydomain.co.uk.
This is a regex step beyond me — can anyone help?
You may use a nested negative lookbehind while uk to fail the already existing negative lookahead for mydomain.co. part before matching uk:
^(?!.*(//|[.](com|net|info|(?<!mydomain\.co\.)uk))).*$
RegEx Demo
Take note of (?<!mydomain\.co\.) which is a negative lookbehind to not to match uk if it is preceded by mydomain.co..

regular expressions: catch any URLs of the domain example.com

I'm trying to get regexp code for the below case. I tried multiple tries but in vain.
I need to catch any URLs of the domain site.com. Tried using regexp '^site.com/*$
but it does not recognizes it.
i'm just looking for regexp code whichmatches site.com/*
With your expression ^site.com/*$ you match all strings that start with site.com and have zero or more trailing / characters (/*):
If you want to match any strings starting with site.com/ you might want to try ^site\.com/.*$:
There are already a lot of other regex questions regarding domain names on SO, but your question is not clear to me in what context you are trying to do this, or what is the actual goal you want to achieve. If you describe your needs more precisely you could probably find some answers on this forum.
I generally use a helper website like regex101.com.
Also, a few things to note, . has a special meaning in regex meaning any character, and if you wanted to capture site.com/foo you might want to use something where you are not limited to the number of characters by the end. I'd do this with groupings.
^(site\.com\/)(.+)$
You can see this in action here: https://regex101.com/r/AU2iYC/2
Your regex ^site.com/*$ is only matched follow sentences
ex) site.com/ site.com//////// site.com
because * asterisk in regex means Match 0 or more of the preceding token.
so, it should be work
^site.com\/.*$

How do I escape a certain word in a URL with Google Analytics regex?

I'm trying to set up a regex to escape a certain word in a URL with Google Analytics goals.
Currently the URL path is:
/pricing/mls
/pricing/armls
/pricing/armls/bundle/5
Step 1 is static, and will always stay that way but step 2 has over 80 different possibilities. I wanted to set up a Regex that will specifically escape "mls". Using the (.*) would also grab the mls page which I'm trying to escape. Currently my regex looks like this:
^\/pricing\/mls$
^\/pricing\/(.*) this is where I'm trying to escape the mls portion
^\/pricing\/(.*)\/bundle\/(5|6|7)
I tried (?!mls) but Google Analytics doesn't support negative look aheads. Any help would be much appreciated, thanks everyone!
Without a negative lookaround is going to get messy.
The only clean way I see is to use an optimized whitelist of alternations like this:
^\/pricing\/((?:[ac]r|t)mls|mlspin|realcomp)\/bundle\/(5|6|7)
Demo
Tip: I used myregextester.com to get the inner part optimized (just enter your pattern, tick the OPTIMIZE checkbox, and submit).
[*]: Side note: Google Analytics doesn't support single and multiline modes since URLs can't contain newlines. So, there should be never any need for ^ and $ to match anywhere but the beginning and end of the whole string.

RegEx match all website links except those containing admin

I'm setting up URL Rewrite on an IIS and i need to match the following URLs using regex.
http://sub.mysite.com
sub.mysite.com
sub.mysite.com/
sub.mysite.com/Site1
sub.mysite.com/Site1/admin
but not:
sub.mysite.com/admin
sub.mysite.com/admin/somethingelse
sub.mysite.com/admin/admin
The site it self (sub.mysite.com) should not be "hardcoded" in the expression. Instead, it should be matched by something like .*.
I'm really blank on this one. I did find solutions to match the different URLs but once i try to combine them either none of them match or all of them do.
I hope someone can help me.
For your specific case, assuming you are matching the part after the domain (REQUEST_URI):
(?!/admin).*
(?!...) is a negative lookahead. I am not sure if it is supported in the IIS URL Rewrite engine. If not, a better approach would be to check for a complementary approach:
Or as #kirilloid said, just match /admin/? and discard (pay attention to slashes).
BTW. if you want to quickly test RegExps with a "visual" feedback, I highly recommend http://gskinner.com/RegExr/
([A-Za-z0-9]+.)+.com(?!/admin)/?([A-Za-z0-9]+/?)*
this should do the trick

Regex for matching complete substring

Sorry for the dummy question but I'm a regular expressions newbie.
I want these matches:
MATCH! http://www.google.com/search?q=...
NO MATCH http://www.googledummy.com/search?q=...
MATCH! http://www.google.it/search?q=...
NO MATCH! http://www.google.it/
NO MATCH! http://www.google.it/foobar
MATCH! google.it/search?q=...
MATCH! google.xxxxx/search?q=...
Should my regex be something like this?
google.[*$]/search
You probably want something like this:
^(?:https?://)?(?:[^.\s]+\.)*google(\.\w+){1,2}/search\?q=
This regex allows:
^ - match from the start - do not allow partial matching of the domain.
(?:https?://)? - http or https protocol.
(?:[^.]+\.)* - sub domains, but not other characters: hello.google.com is OK.
google
Does not allow:
http://notgoogle.com/search?q=
http://example.com?google.com/search?q=
Problems:
(\.\w+){1,2} - allows google.co.il, but also google.hackers.com. It's problematic unless you want to white list all two-word tlds.
the q query parameter may not be the first one (though, maybe that is one of the requirements).
\w may not fit all characters that are valid in top level domains (though google is not likely to buy google.קום)
Example: http://rubular.com/r/Avd5RFs3oH
Conclusion - If at all applicable, use a URL parser :)
from what you wrote I would say
google\.[a-z]+\/search
whether you should use \/ or just / before search depends on the language you are using.
As SeRPRo this does not work for google.co.uk, to make it working with it you can use:
google\.[a-z]+(?:\.[a-z])?\/search
(is there any country that requires a third level?)
This one works:
google\.[a-zA-Z\.]+/(search\W.+)
Example
You might want the following:
google\.[a-zA-Z.]+/search
Both other answers should work fine until you meet a second-level google site like google.com.ua