Regex for Google filter - regex

My Google Analtyics is victim of Ghost Spam, in particular is something which looks like
/?from=http://site-63162314-1.snip.tw/
/?from=http://site31946091.snip.tw/
/?from=http://site18623769.snip.tw/
I'd like to create a regex which will match the following string: site(something).snip.tw
After a read up on a few sites of how to use Regex, my effort is
.*site.*snip
This only fiters out the last in the list above (/?from=http://site18623769.snip.tw/) but oddly, it also filters out other URLs, such as ?from=http://share-buttons.xyz/ (which is fine) but sadly so has /MassageTherapy/DeepTissueMassage?cpc=dtm
Why does my regex not match the above 3 examples?

Related

Trying to regex YouTube ads with pihole

EDIT:
As far as I know, Pihole does not block YouTube ads.
Original Post:
Trying to regex urls like:
r4---sn-vgqsrnez.googlevideo.com
r1---sn-vgqsknlz.googlevideo.com
r5---sn-vgqskn7e.googlevideo.com
r3---sn-vgqsknez.googlevideo.com
r6---sn-vgqs7ney.googlevideo.com
r4---sn-vgqskne6.googlevideo.com
r4---sn-vgqsrnez.googlevideo.com
r5---sn-vgqskn76.googlevideo.com
r6---sn-vgqs7ns7.googlevideo.com
r1---sn-vgqsener.googlevideo.com
r1---sn-vgqskn7z.googlevideo.com
r1---sn-vgqsknek.googlevideo.com
r6---sn-vgqsener.googlevideo.com
r3---sn-vgqs7nly.googlevideo.com
r1---sn-vgqsknes.googlevideo.com
r4---sn-vgqsrnes.googlevideo.com
r6---sn-vgqskn76.googlevideo.com
I've tried:
(^|\.)r[0-100]---sn-vgqs?n??\.googlevideo\.com$
(^|\.)r[0-100]?*\.googlevideo\.com$
^r[0-100]---sn-vgqs(?:.*)n(?:.*)(?:.*).googlevideo.com$
^r[0-100]---sn-vgqs(?:.*)n(?:.*).googlevideo.com$
but nothing works
I am probably using regex wrong because I don't have much experience with it but looking online some people have said it could be a thing with Pihole.
I'm guessing that you'd like to have restricted boundaries, if not though, this expression might be somewhat close to what you have in mind:
^r\d+---sn-vgqs[a-z0-9]{4}\.googlevideo\.com$
Demo 1
You can add more boundaries, if necessary, such as:
^r(?:100|[1-9]\d|\d)---sn-vgqs[a-z0-9]{4}\.googlevideo\.com$
Demo 2
or:
^r(?:100|[1-9]\d|\d)---sn-vgqs(?:rne(?:s|z)|kne(?:s|z)|knlz|kn7e|7ney|kne6|kn76|7ns7|ener|kn7z|knek|7nly)\.googlevideo\.com$
Demo 3
which I'm just guessing.
If you wish to explore/simplify/modify the expression, it's been
explained on the top right panel of
regex101.com. If you'd like, you
can also watch in this
link, how it would match
against some sample inputs.
The following Regex match all the url start with "r" then followed by anything else without limiting number of character then followed by "sn" then followed by any number of characters then end with ".googlevideo.com" the expression was anchor with ^ and $.
I try it on my pihole with great success but have to remove it later. all r....sn...googlevideo.com was blocked in the query list but it also rendered my smart tv youtube app broken. It will not play any video at all unless I remove it from pihole. use it at your own risk.
^r.+sn.+(\.googlevideo\.com)$
The post is a bit older but because I tried myself with regexes I just want to say that your regexes can't work because of one "little" point.
Pi-Hole uses the POSIX ERE (POSIX Extended Regular Expressions) standard.
So there are no lazy quantifiers or shorthand character classes.
It also does not support non-capturing groups like in your third and fourth line.
You can check such regexes in tools like RegexBuddy. Maybe other free tools can check it too and help to convert it.
My current regex is:
^r[[:digit:]]+---sn-4g5e[a-z0-9]{4}\.googlevideo\.com$
It correctly blocks all ads BUT also videos.
If you use it you have to do the following.
Open a youtube video and check if the video loads.
If not, go to your pi hole dashboard to the query log.
For your device you will have two dns queries
r5---sn-4g5e6nze.googlevideo.com
and
r5---sn-4g5ednse.googlevideo.com
The last one (upper) in the query log is the video. So whitelist
the dns. You have to do it sometimes.
Greetings

Google analytics regex goal not working correctly

I have a regex to track signups to my site. There could be multiple adresses for a goal.
Here is my regex:
(\/membership\/signed-up\/|\/membership\/campagin\/(?!.*(not-this-campaign)).[-\w]+\/signed-up\/)
I want to match this adresses:
/membership/signed-up/
/membership/campagin/random-campaign/signed-up/
/membership/campagin/other-random-campaign/signed-up/
But I want to exclude this address:
/membership/campagin/not-this-campaign/signed-up/
It works, but it google also matches this address:
/membership/signed-up/step-2/
When I test in http://regexr.com it matches only on the strings I want, but why is google analytics matching more?
Try this :
(\/MEMBERSHIP\/SIGNED\-UP\/(?!.*(STEP\-2))|\/MEMBERSHIP\/CAMPAGIN\/(?!.*(NOT\-THIS\-CAMPAIGN)).[-\w]+\/SIGNED\-UP\/)
You regex its almost correct, but, you need to ensure it dont match with STEP 2

KimonoLabs crawler Generated URL List with regex

So, I'm trying to crawl a website that has like 7,000 product pages and the link structure is like this:
https://example.com/category/sub-category/numericid-name-of-the-product/
What I'm trying to achieve is to Generate a URL list, the Kimono App has that option, and it actually sections the URL but I'm only offered default value, range, and custom list.
I tried to put in stuff like "/.+/" to match all the chars, but that does not work, I couldn't find any help on that on official kb.
.I know that import.io had that "{alpahnumeric}" for example for different parts of URL so it matches them, is there a way to accomplish that in kimonolabs app?
Try this regex: https://example.com/([^/]+)/([^/]+)/([0-9]+)-([^/]+)
Note: you may need to escape some characters (namely / would be escaped as \/).
Also, I'm not familiar with KimonoLabs, so I don't know if this is what you're looking for exactly. Feel free to clarify.
Explanation
https://example.com/ literally
([^/]+)/ a bunch of not /s, followed by a /
([0-9]+)-([^/]+) Numbers followed by another bunch of not /s

Regular expression not working in google analytics

Im trying to build a regular expression to capture URLs which contain a certain parameter 7136D38A-AA70-434E-A705-0F5C6D072A3B
Ive set up a simple regex to capture a URL with anything before and anything after this parameter (just just all URLs which contain this parameter). Ive tested this on an online checker: http://scriptular.com/ and seems to work fine. However google analytics is saying this is invalid when i try to use it. Any idea what is causing this?
Url will be in the format
/home/index?x=23908123890123&y=kjdfhjhsfd&z=7136D38A-AA70-434E-A705-0F5C6D072A3B&p=kljdaslkjasd
so i just want to capture URLs that contain that specific "z" parameter.
regex
^.+(?=7136D38A-AA70-434E-A705-0F5C6D072A3B).+$
You just need
^.+=7136D38A-AA70-434E-A705-0F5C6D072A3B.+$
Or (a bit safer):
^.+=7136D38A-AA70-434E-A705-0F5C6D072A3B($|&.+$)
And I think you can even use
=7136D38A-AA70-434E-A705-0F5C6D072A3B($|&)
See demo
Your regex is invalid because GA regex flavor does not support look-arounds (and you have a (?=...) positive look-ahead in yours).
Here is a good GA regex cheatsheet.
To match /home/index?x=23908123890123&y=kjdfhjhsfd&z=7136D38A-AA70-434E-A705-0F5C6D072A3B&p=kljdaslkjasd you can use:
\S*7136D38A-AA70-434E-A705-0F5C6D072A3B\S*

A URL that contains all valid characters to test my regex pattern?

First of all I created my own regex to find all URLs in a text, because:
When I searched SO and google only found regex for specific URL constructions, like images, etc.
I found a pretty complete regex from the PHP's manual itself (see "splattermania at freenet dot de 01-Oct-2009 12:01" post at http://php.net/manual/en/function.preg-match.php) that can find almost anything that resembles a URL, as little as "bit.ly".
This pattern has a few errors and constraints, so I'm fixing and enhancing it.
Now the pattern structure seems right, but I'm not sure all valid characters are present. Please post samples of URLs to test my pattern. Might be laziness, but I don't want to read pages and pages of references to find all of them, need to focus on the development. If you have a summary of valid chars for username, password, path, query and anchor that you can share, that would be very very helpful.
Best Regards!
The pattern you linked to does indeed match a lot of URLs, both valid and invalid. It's not really a surprise since nearly everything in that regex is optional; as you wrote yourself, it even matches bit.ly, so it's easy to see how it would match lots of non-URL stuff.
It doesn't take new Unicode domain names into account, for one (e.g., http://www.müller.de).
It doesn't match valid URLs like
http://msdn.microsoft.com/en-us/library/aa752574(VS.85).aspx
It doesn't match relative paths (might not be necessary, though) like /cgi-bin/version.pl.
It doesn't match mailto: links.
It doesn't match URLs like http://1.2.3.4. Don't even ask about IPv6 :)
All in all, regular expressions are NOT the right tool to reliably match or validate URLs. This is a job for a parser. If you can live with many false positive and false negative matches, then regexes are fine.
Please read Jan Goyvaerts' excellent essay on this subject: Detecting URLs in a block of text.