Google Analytics IP Filter Exclude - regex

Could someone help me with some REGEX...
I have been blocking internal traffic using the filter pattnrn:
10.*..
This just bit me in the foot as this is blocking all referral traffic between our sites.
What I want to do now is block everything except 10.103..
Do I need to apply two separate ranges, or can I accomplish this with one filter?

If you want to block everything but 10.103.xxx.xxx, use an include filter instead of the usual exclude filter.
NOTE ABOUT REGEXES MATCHING IPs IN ANALYTICS
I am not sure if the filter I suggested above uses regex or not (literal string match), but it doesn't make a difference because there's no way the expression 10.103. could be misinterpreted in an IP address.
Your original pattern, on the other hand, is bogus and is probably hurting you. That's because in a regex the dot . is not a literal dot, but represents any character. Your expression, in fact, excludes every single IP that merely starts with 10 (not just 10. that is ten-dot), including 100.xxx, 101.xxx etc.
The correct version of your original excluding regex would be 10\..*, which contains an escaped dot (\.), then proceeds to any characters after that (.*).

REGEXP are very good explained in the Google Analytics Help (here).
For multiple IPs, there is this little helper, which generates the REGEXP for you.
If you want to block internal traffic, just ADD NEW FILTER and CUSTOM then EXCLUDE and put the IP in REGEXP in the field, that's it.

Related

Regex that can handle an arbitrary number of asterisks in a word

I'm trying to write a regex for x509 CN/SAN validation and have just learned that apparently partial wildcards are possible in theory. How would I build a regex to handle this when I want to make sure that it captures all certificates that might be issued for example.org?
My naive approach would be
\**e\**x\**a\**m\**p\**l\**e\**.\**o\**r\**g\**
not including possible subdomains of course. This looks pretty bad though and really inflates the term longer than I'd like it to be. Is there a more concise way to get the behaviour I described?
Edit: I also just realised that my naive regex wouldn't even catch when someone uses the asterisk to replace a part of the domain, e.g. exa*.org.
Since I feel like there's a possibility that this is not easily expressible in a concise regex, I solved my use case within the Python code that surrounds my previous regex check.
Instead of mapping a regex to the domains appearing in a certificate, I instead convert the certificate domain into a regex pattern, replace the literal dots with escaped dots and the asterisk with [a-zA-Z0-9-]{0,63}. I then compare it to the list of domains I manage and if the regex matches, I know that the certificate is applicable to the managed domain.
If someone manages to express this in a concise regex I'd still be interested.

Google Analytics Content Grouping by Extraction - extract 3rd level subdirectories

I've been going round in circles with this one.
I'd like to perform a content grouping in google analytics that groups by a 3rd level subdirectory.
I can grab the second level successfully with the following regex
`/destinations/(.*?)/`
where the url is
mydomain.com/destinations/europe
mydomain.com/destinations/alaska
I get content groups of europe and alaska.
However, I also then want a grouping of the next level, for example
mydomain.com/destinations/europe/southampton
mydomain.com/destinations/europe/portugal
mydomain.com/destinations/alaska/somealaskanplace
to give me groupings of southampton, portugal and somealaskanplace
This means i need to effectively ignore whatever's in the second level and this is what i'm struggling with.
So far i have
`/destinations\/.*\/(.*?)/$`
but that's given me the domain name as a grouping
Can anyone help? It would be very much appreciated.
You need to have the Multiline flag On
Check this:
/.*?\/(destinations)\/(\w+)\/(\w+)/gm
Demo on Regex101:
https://regex101.com/r/2wvRIx/2
I don't think you need the / delimiters. GA may be interpreting your last /$ as being a slash and then end-of-string. Try making it just /destinations/.*/(.*?)$ (note that GA regex does not require you to escape slashes).

regex to find domain without those instances being part of subdomain.domain

I'm new to regex. I need to find instances of example.com in an .SQL file in Notepad++ without those instances being part of subdomain.example.com(edited)
From this answer, I've tried using ^((?!subdomain))\.example\.com$, but this does not work.
I tested this in Notepad++ and # https://regex101.com/r/kS1nQ4/1 but it doesn't work.
Help appreciated.
Simple
^example\.com$
with g,m,i switches will work for you.
https://regex101.com/r/sJ5fE9/1
If the matching should be done somewhere in the middle of the string you can use negative look behind to check that there is no dot before:
(?<!\.)example\.com
https://regex101.com/r/sJ5fE9/2
Without access to example text, it's a bit hard to guess what you really need, but the regular expression
(^|\s)example\.com\>
will find example.com where it is preceded by nothing or by whitespace, and followed by a word boundary. (You could still get a false match on example.com.pk because the period is a word boundary. Provide better examples in your question if you want better answers.)
If you specifically want to use a lookaround, the neative lookahead you used (as the name implies) specifies what the regex should not match at this point. So (?!subdomain\.)example trivially matches always, because example is not subdomain. -- the negative lookahead can't not be true.
You might be better served by a lookbehind:
(?<!subdomain\.)example\.com
Demo: https://regex101.com/r/kS1nQ4/3
Here's a solution that takes into account the protocols/prefixes,
/^(www\.)?(http:\/\/www\.)?(https:\/\/www\.)?example\.com$/

Creating filters for Google Analytics to remove spam

I have successfully managed to filter out hits from certain spammy sites from Google Analytics. It's an ongoing battle, as new sites are popping up all the time and polluting my acquisition/referral results.
At present, the following match is used by the GA filter to stop all the sites below showing up in the data:
.*(best\-seo\-solution|semalt|buttons\-for\-website|social\-buttons|best\-seo\-offer|Get\-Free\-Traffic\-Now|buttons\-for\-your\-website|free\-share\-buttons)\.com.*
I've added most of these myself and it works however I now need to create a pattern that allows me to input URLs that aren't a standard something.com pattern. E.g:
site4.free-share-buttons.com
site5.free-share-buttons.com
So in these cases the end is always the same but the start can be variable.
buy-cheap-online.info
In this case it ends with .info
www.event-tracking.com
This one uses www. whereas others do not
http://webmaster-traffic.com
This one has the http:// as well.
And on top of all of that, the filter pattern can only be 255 maximum characters (but I can have more than one filter pattern) so I need to segment it up.
How can I create a regex filter pattern that would target all above URLs?
Google Analytics allows to create regex without having to escape all especial characters when the expression is simple. So you can write the expression without the backslashes \ and .* You can even remove the .com and the parenthesis since these names are very specific already
best-seo-solution|semalt|buttons-for-website|social-buttons|best-seo-offer|Get-Free-Traffic-Now|buttons-for-your-website|free-share-buttons|event-tracking|buy-cheap.info
If you happen to have a spam with a common name just add the full name |commonname.net for this specific case.
You can keep going until you reach 255 characters after that just add a second filter. This will work, but it has 3 downsides,
first there is 1 or 2 new spammers every week
second by the time you add it you already have some hits
third and this is a new behavior, some spam in now hitting with direct visits along with the referral and this won't be stopped by this filter.
To prevent this, I recommend you to use a valid hostname filter instead, this filter will only allow hits with one of your hostnames, and all ghost spam will be excluded since they use either a fake hostname or is not set.
Here you can find more information about referrer spam and the valid hostname filter
https://stackoverflow.com/a/28354319/3197362
http://www.ohow.co/things-you-must-know-about-spam-in-google-analytics/

Regex for URL routing - match alphanumeric and dashes except words in this list

I'm using CodeIgniter to write an app where a user will be allowed to register an account and is assigned a URL (URL slug) of their choosing (ex. domain.com/user-name). CodeIgniter has a URL routing feature that allows the utilization of regular expressions (link).
User's are only allowed to register URL's that contain alphanumeric characters, dashes (-), and under scores (_). This is the regex I'm using to verify the validity of the URL slug: ^[A-Za-z0-9][A-Za-z0-9_-]{2,254}$
I am using the url routing feature to route a few url's to features on my site (ex. /home -> /pages/index, /activity -> /user/activity) so those particular URL's obviously cannot be registered by a user.
I'm largely inexperienced with regular expressions but have attempted to write an expression that would match any URL slugs with alphanumerics/dash/underscore except if they are any of the following:
default_controller
404_override
home
activity
Here is the code I'm using to try to match the words with that specific criteria:
$route['(?!default_controller|404_override|home|activity)[A-Za-z0-9][A-Za-z0-9_-]{2,254}'] = 'view/slug/$1';
but it isn't routing properly. Can someone help? (side question: is it necessary to have ^ or $ in the regex when trying to match with URL's?)
Alright, let's pick this apart.
Ignore CodeIgniter's reserved routes.
The default_controller and 404_override portions of your route are unnecessary. Routes are compared to the requested URI to see if there's a match. It is highly unlikely that those two items will ever be in your URI, since they are special reserved routes for CodeIgniter. So let's forget about them.
$route['(?!home|activity)[A-Za-z0-9][A-Za-z0-9_-]{2,254}'] = 'view/slug/$1';
Capture everything!
With regular expressions, a group is created using parentheses (). This group can then be retrieved with a back reference - in our case, the $1, $2, etc. located in the second part of the route. You only had a group around the first set of items you were trying to exclude, so it would not properly capture the entire wild card. You found this out yourself already, and added a group around the entire item (good!).
$route['((?!home|activity)[A-Za-z0-9][A-Za-z0-9_-]{2,254})'] = 'view/slug/$1';
Look-ahead?!
On that subject, the first group around home|activity is not actually a traditional group, due to the use of ?! at the beginning. This is called a negative look-ahead, and it's a complicated regular expression feature. And it's being used incorrectly:
Negative lookahead is indispensable if you want to match something not followed by something else.
There's a LOT more I could go into with this, but basically we don't really want or need it in the first place, so I'll let you explore if you'd like.
In order to make your life easier, I'd suggest separating the home, activity, and other existing controllers in the routes. CodeIgniter will look through the list of routes from top to bottom, and once something matches, it stops checking. So if you specify your existing controllers before the wild card, they will match, and your wild card regular expression can be greatly simplified.
$route['home'] = 'pages';
$route['activity'] = 'user/activity';
$route['([A-Za-z0-9][A-Za-z0-9_-]{2,254})'] = 'view/slug/$1';
Remember to list your routes in order from most specific to least. Wild card matches are less specific than exact matches (like home and activity), so they should come after (below).
Now, that's all the complicated stuff. A little more FYI.
Remember that dashes - have a special meaning when in between [] brackets. You should escape them if you want to match a literal dash.
$route['([A-Za-z0-9][A-Za-z0-9_\-]{2,254})'] = 'view/slug/$1';
Note that your character repetition min/max {2,254} only applies to the second set of characters, so your user names must be 3 characters at minimum, and 255 at maximum. Just an FYI if you didn't realize that already.
I saw your own answer to this problem, and it's just ugly. Sorry. The ^ and $ symbols are used improperly throughout the lookahead (which still shouldn't be there in the first place). It may "work" for a few use cases that you're testing it with, but it will just give you problems and headaches in the future.
Hopefully now you know more about regular expressions and how they're matched in the routing process.
And to answer your question, no, you should not use ^ and $ at the beginning and end of your regex -- CodeIgniter will add that for you.
Use the 404, Luke...
At this point your routes are improved and should be functional. I will throw it out there, though, that you might want to consider using the controller/method defined as the 404_override to handle your wild cards. The main benefit of this is that you don't need ANY routes to direct a wild card, or to prevent your wild card from goofing up existing controllers. You only need:
$route['404_override'] = 'view/slug';
Then, your View::slug() method would check the URI, and see if it's a valid pattern, then check if it exists as a user (same as your slug method does now, no doubt). If it does, then you're good to go. If it doesn't, then you throw a 404 error.
It may not seem that graceful, but it works great. Give it a shot if it sounds better for you.
I'm not familiar with codeIgniter specifically, but most frameworks routing operate based on precedence. In other words, the default controller, 404, etc routes should be defined first. Then you can simplify your regex to only match the slugs.
Ok answering my own question
I've seem to come up with a different expression that works:
$route['(^(?!default_controller$|404_override$|home$|activity$)[A-Za-z0-9][A-Za-z0-9_-]{2,254}$)'] = 'view/slug/$1';
I added parenthesis around the whole expression (I think that's what CodeIgniter matches with $1 on the right) and added a start of line identifier: ^ and a bunch of end of line identifiers: $
Hope this helps someone who may run into this problem later.