Google Analytics Referral Exclude Regex Partial Domain Name - regex

I am attempting to filter out some of the nasty analytics referral traffic. It doesn't touch my site, so htaccess is out.
I have to specifically go into Google to create a filter. I have a few setup already, but am looking to try something new that will hopefully make my exclusion list a bit easier to manage.
I want to block any referral traffic coming from a domain that has seo, traffic, monitize, etc. in it. This would stop about 90% of the referral traffic and would keep excluding sites.
What I currently use is this:
(seomonitizer|trafficseo|seotraffic|trafficmonitizer)\.(com|org|net|рф|eu|co)
It removes each site one by one, but when a new site hits, I have to add it to the list.
I'm not sure what the regex capabilities and limitations are of the Analytics filters, but possibly this may be the foundation, I'm just not sure what goes into the middle.
((?=())\.(?=()))
Thanks

Unfortunately you will have to TO check and add each one of them to your list as they are appearing in your account. To answer your question I use as in the following example:
.*((darodar|priceg|buttons\-for(\-your)?\-website|makemoneyonline|blackhatworth|hulfingtonpost|o\-o\-6\-o\-o|(social|(simple|free|floating)\-share)\-buttons)\.com|econom\.co|ilovevitaly(\.co(m)?)|(ilovevitaly(\.ru))|(humanorightswatch|guardlink)\.org).*
I like to use .co(m)? instead of .com for example
Remember To avoid having ghost referrals currently there are 3 methods.
1) The first one (the one you are using) would be to create a filter that will blacklist all the bad traffic, but there is a limit for the amount of character you can use, so you might end up creating multiple similar filters to cover all the nasty analytics referral traffic. Here is a link with a complete list of bad bots.
2) the second method is to check the box "Exclude all hits from known bots and spiders" in your Google Analytics Account >Property >View
3) Create a hostname Filter following this article steps.

Related

Blocking based on full URL and not just the URI in AWS WAF

I am using AWS WAF across multiple CloudFront distributions which go to different URLs. Generally speaking, it is working well. However, we have noticed particular activity on a few of the underlying sites that I want to block, but I don't want to block it across all the sites.
It seemed simple enough to me to create a WAF rule that would match a regex on the URI and block based on that. However, it appears that AWS WAF does not use the host in its URI matching. For example this rule:
Inspect URI, Block based on RegEx with RegEx being:
^(http|https):\/\/(www)?\.?example\.net\/(.*)?\/*.html$
And these test URLs work in my regex tester:
http://example.net/blah.html
https://example.net/blah.html
http://www.example.net/blah.html
https://example.net/stuff/blah.html
When I apply it to the WAF, though, it does not block.
Is there something else I can do here to achieve what I am looking to do? I do not want to edit anything directly on my hosting servers because it would be more of a maintenance headache and it would not solve the problem I am attempting to solve (which is stop bots from spamming bad URLs and spiking my server with 404s).
I also realize someone may suggest I could do a rate limit - which I do have in place - but the bots are coming from many different IPs so that doesn't solve this particular case. Instead, I just want to block some of the URL types that they keep trying to get to. In this case, it's thousands and thousands of HTML pages. It also does not take into account that I only want to block these requests for a very specific site.

Retrieve/use G suite Default Routing rules programatically

I am only looking for read-only access.
I'd like to develop either a small web app, or maybe a script embedded in Google Sheets, that allows my users can look up which Google Admin default routing rules they are involved in.
To do that, I'll need an API to go through the rules and tabulate the information in the way I need it.
Can I do that with Admin SDK, which is soon-to-be deprecated? Is there a replacement product that can do what I want?
More details:
I currently use default routing for a few purposes. I have about 15 rules, and each one changes the route of a simple Match Rule by adding extra recipients. Some of these are to catch emails sent to ex-employees.
Others are to handle certain general email addresses like sales#example.com. Rather than using a sales group, we have a sales user account. And rather than putting forwarding rules in that users' settings, we use Default routing.
I had a similar problem where I needed the routing rules.... a bit different case since I just wanted a one time access to see what was going on - not necessarily something for users. I could not find anywhere else that helped me even retrieve the rules (Other than open each one up individually). I ended up finding that I could just scrape the HTML of the routing rules page to a CSV and filter for lines with an '#' character. The rules have a bunch of t/f that presumably can be matched back to their function - I didn't need all that and didn't spend the time to figure it out. This probably doesn't help the for the original post case, but perhaps my finding can help the next person looking for a way to do this.

How to validate Top Level Domain of an email address?

Let's say I have a contact form where a user can enter his email address along with his other contact details. I need to check the validity of Generic top level domain or top level domain of the email address. An example:
scarlet.1992#examplemail.paris
I need to check if .paris is a valid top level domain.
Please refer to this link for the list of domains available, which gives a number around 1200. Storing the domain names in a local table and searching is not an option since new domains are being introduced everyday.
Please let me know if there is any web service or free API available for this, or there is any other way to validate the email address.
The simplest way to find out whether a domain exists is to check whether it has a name server.
Considering that a TLD costs around $100,000 it is very likely that every one that is purchased is in use. Also, if it doesn't have a name server, you can't send anything to it anyway.
Using dig you can run
dig NS +short paris
which will give
h.ext.nic.fr.
d.nic.fr.
g.ext.nic.fr.
f.ext.nic.fr.
whereas
dig NS +short adsfadfs
returns nothing.
There is nothing wrong with storing a list of TLDs locally when you need a quick answer for client-side validation or don't want to consume network resources for a reverse DNS lookup.
Email addresses from newer TLDs are extremely rare for most use cases. I update my list about once per year and find that it's good enough.

How to check spammyness of a link/url

I know that most spams are related with one or more links, so I am wondering if there is any web service which can check the spam-weight/spammyness of a URL. Similar to how Akismet can check the spammyness of text content.
p.s. - I searched in google and couldn't find anything satisfactory :)
There are a number of different URI DNS-based Blackhole (or DNSBL) services available to the public for low volume lookups. Two of the most well-known are SURBL and URIBL. PhishTank (run by OpenDNS) is also worth a look as many of the URLs are categorized and classified along with being listed.

How does a tool like SEOMoz Rank Checker work?

It seems there are a number of tools that allow you to check a site's position in search results for long lists of keywords. I'd like to integrate a feature like that in an analytics project I'm working on, but I cannot think of a way to run queries at such high volumes (1000s per hour) without violating the Google TOS and potentially running afoul of their automatic query detection system (the one that institutes a CAPTCHA if search volume at your IP gets too high).
Is there an alternative method for running these automated searches, or is the only way forward to scrape search result pages?
Use a third party to scrape it if you're scared of Google's TOS.
Google is very hot on banning/blocking temporarily IP addresses that appear to be sending automated queries. And yes of course, this is against their TOS.
It's also quite difficult to know exactly how they are detecting them but the main reason is definitely identical keyword searches from the same IP address.
The short answer is basically: Get a lot of proxies
Some more tips:
Don't search further than you need to (e.g. the first 10 pages)
Wait around 4-5 seconds between queries for the same keyword
Make sure you use real browser headers and not something like "CURL..."
Stop scraping with an IP when you hit the road blocks and wait a few days before using the same proxy.
Try and make your program act like a real user would and you won't have too many issues.
You can scrape Google quite easily but to do it at a very high volume will be challenging.