perl module to detect foreign url - regex

I'm making a crawler and I only want to use U.S. domains. For example, I would want:
http://thenorthface.com/
but I would not want:
http://uk.thenorthface.com
or
http://se.thenorthface.com/
Does anyone know of a way to do this or a perl module that does this? I know it could be done with regex, but I'm trying to avoid having to get together a list of all foreign domain beginnings... Thanks a lot!

You cannot reliably determine what a "US" domain is from the URL. It's not even clear that the term "US domain" has any meaning.
For example, many US state abbreviations are also ISO-3166 country codes. What will you do with ar.xyz.com. Is that Arkansas or Argentina? What about ma.pdq.com... Massachussetts or Morocco (Maroc in French)?
You may be able to link second-level domains to a country (for a headquarters at least) but hostnames and third-level domains will be impossible to classify.

Related

how to deal with different ways to write the same thing

I wanna know if Django has any module to deal with this problem.
I have multiple ways of writing the same city name in a Postgresql database that came from scraping different websites. The field "city name" could be "S. Diego" or "San Diego". My question is if I could have a module that could normalize always to "San Diego" in both situations and I could add some normalization when some new word appear like "S Diego", and maintain this workflow.
Thanks
You can use an API to normalize the data you have scraped. Yandex or Google have feature to return a possible list of the location names based on your search query. Get the most possible answer they returned and use it to map your input to the correct one. There are manual mapping features but I highly recommend one of the giants that solved the problem before us.

REGEXP_MATCH multiple words in a string using CASE statement in Google DataStudio

I am using Google Datastudio to make a CASE statement to take a multi-words string and split it out into categories. I was asked to use REGEXP_MATCH (nothing else, I know contains function would be easier).
I need a solution to match the following words:
HouseBrochure
home brochure
HomeBrochure
house brochure
Bathroom brochure
Bathroombrochure
FloorBrochure
floor brochure
To complicate matters, these words come in via a website request system, meaning people can request a house, bathroom and floor brochure in one request. When such requests reach my server, it compiles into a list(string) which looks like this:
# (with the pipes included)
HouseBrochure|Bathroom brochure|floor brochure
This is just an example of 1 request, there are many variations and multiple requests that come through (I've also only included a few of these brochures, there are many more)
I need to separate out all the house brochures, all the bathroom brochures and all the floor brochures etc, so I can count how many requests have been made for each brochure.
Being new to Regex, I have a basic understanding but nowhere near advanced.
My current attempt in Data studio looks like this:
CASE
WHEN REGEXP_MATCH(Event Label,'^.*(HouseBrochure.*|home brochure.*|HomeBrochure.*|house brochure.*).*$') THEN 'Home Brochure'
END
This is just for the home brochure, yet it's not working, can someone help?
Also, as an FYI Datastudio uses REG2
My approach would be:
convert everything to lower case (avoid messing with upper/lower case differences)
Use regex to replace variations with base form:
e.g.
(house|home)\s*brochure
replace with
HomeBrochure
Test here.
Do some counting as needed, using just the base keywords.

Correct non existent domain name to nearest match

I'm looking for a service that tells you the nearest match of a non existent domain, because it was misspelled by the user. For example, if an user writes 'hotmail.con', send a query with that and obtain as a result 'hotmail.com'.
You've picked a hard problem. A domain can be 1-63 characters long, shall contain characters [a-z0-9-], and shall not start with a hyphen. Brute forcing it not an option. If the user types in hotmail.con you could search misspellings of it, which would try homail.com and hotmale.com, which may or may not be accurate domain names, who is to know WHICH mis-spelling is the correct one? The computer would have to return a list of options to the user: "Did you mean this domain name, or maybe or that domain name?".
You might be interested in Peter Norvig's spelling corrector that Google uses to spell check queries that come in. It's one of the best spelling correctors on the planet.
http://norvig.com/spell-correct.html
Peter Norvig's Spell checker should work provided you had a body of correct domain names which is up to date. You could create your own list on the fly, by keeping a list of which sites the user has been to, and using those as the body of domain names to check against. That way, when the user selects "hotmail.con" it finds hotmail.com in your list. However, this does not protect the user from accidentally visiting: "hotmale.com". Because that is a valid site.
Here is a stackoverflow qustion about how to get all the domain names:
https://stackoverflow.com/questions/4539155/how-to-get-all-the-domain-names
The best idea is to think outside the box and do it like firefox does it. When the user starts typing hotmail.com, what they usually do is click a textbox, type "h", then "o". Have a dropdown come out with recently visited domain names that start with that.

Regular Expressions - Parsing Domain Issues

I am trying to find the domain -- everything but the subdomain.
I have this regexp right now:
(?:[-a-zA-Z0-9]+\.)*([-a-zA-Z0-9]+(?:\.[a-zA-Z]{2,3})){1,2}
This works for things like:
domain.tld
subdomain.tld
But it runs into trouble with tld's like ".com.au" or ".co.uk":
domain.co.uk (finds co.uk, should find domain.co.uk)
subdomain.domain.co.uk (finds co.uk, should find domain.co.uk)
Any ideas?
I'm not sure this problem is "reasonably solvable"; Mozilla maintains a list of 'public suffix' domains that is intended to help browser authors accept cookies for only domains within one administrative control (e.g., prevent someone from setting a cookie valid for *.co.uk. or *.union.aero.). It obviously isn't perfect (near the end, you'll find a long list of is-a-caterer.com-style domains, so foo.is-a-caterer.com couldn't set a cookie that would be used by bar.is-a-caterer.com, but is-a-caterer.com is perfectly well a "domain" as you've defined it.)
So, if you're prepared to use the list as provided, you could write a quick little parser that would know how to apply the general rules and exceptions to determine where in the given input string your "domain" comes, and return just the portion you're interested in.
I think simpler approaches are doomed to failure: some ccTLDs such as .ca don't use second-level domains, some such as .br use dozens, and some, like lib.or.us are several levels away from the "domain" such as multnomah.lib.or.us. Unless you're using curated lists of which domains are a public suffix, you're doomed to being wrong for some non-trivial set of input strings.

Google Geocoding UK address does not get listed

I followed the examples in following site,
http://code.google.com/apis/maps/documentation/geocoding/
So I expected following url would give me UK addresses. But it is still giving me US address. Any ideas?
http://maps.googleapis.com/maps/api/geocode/json?address=baker&sensor=false&region=gb
The "region" parameter will only make the region a preference not lock out all other results.
In this case it seems the address "Baker" doesn't even show up on Google Maps as a known location - only as businesses.
http://maps.google.co.uk/maps?q=baker&hl=en&sll=53.800651,-4.064941&sspn=18.336241,46.362305&t=h&z=5
I also Googled "Baker village', "baker town", etc but with no luck. I'm guessing that location is particularly obscure and so Google is returning what it considers the more likely results - in the US.
If you try another example like "Birmingham" which is in both the US and UK you'll notice it favours the UK due to the region tag setting:
http://maps.googleapis.com/maps/api/geocode/json?address=birmingham&sensor=false&region=uk