How to check availability of short domains containing a word? - list

I need to check the availability of all short domains that contain a word "hello". It can be anything like "hellohi", "aahellokk" or "hellowhello". I know that there are services, like http://www.bluehost.com/cgi-bin/signup, where you need to type the domains one-by-one. However, I want to bulk-check them. Then, I need to generate a list of words. I mistakenly tested in Zsh:
echo {1..10}hello{A..Z}{5} > test
I don't know what is the easiest way to generate the list of words. How would you check the availability?

Here is my Python solution. To generate the domains use something like this:
from itertools import product, permutations
import operator
chars = 'abcdefghijklmnopqrstuvwxyz0123456789'
l = 2 # Max prefix / suffix length
words = reduce(operator.add, [[''.join(p) for p in permutations(chars, i)] for i in range(1, l+1)])
domains = [w[0] + 'hello' + w[1] for w in product(words, words)]
This will take ages and use loads of memory if l is larger than 2 or 3. Also, you'll need Python 2.6 for some of the itertools functionality.
To check if the domains are available use this:
import commands
for domain in domains:
output = commands.getoutput('whois %s.com' % domain).lower()
if 'not found' in output or 'no match' in output:
print domain + '.com'
To speed this up you could use threads for the whois check.

If you really want a zsh solution, use e.g. host, dig or nslookup to perform a DNS query, and assume that a failure means that a domain is still available. Keep an eye out for performance: some of these utilities may be faster than others.
If I may ask: what do you need this for? Are you a domain name squatter?

you can use this domain search api to check domain name availability

For anything but the shortest names and large words, the number of possible domains is extremely large; infeasibly large to create a list of them. For example, for a potential 11-letter domain name that you want to check a 4-letter word for, you're looking at at least 2 BILLION combinations (rough estimate). Of course, if you wanted to check that 11-letter domain name for a 10-letter word, you're looking at just 72 possibilities.

Related

Validate Street Address Format

I'm trying to validate the format of a street address in Google Forms using regex. I won't be able to confirm it's a real address, but I would like to at least validate that the string is:
[numbers(max 6 digits)] [word(minimum one to max 8 words with
spaces in between and numbers and # allowed)], [words(minimum one to max four words, only letters)], [2
capital letters] [5 digit number]
I want the spaces and commas I left in between the brackets to be required, exactly where I put them in the above example. This would validate
123 test st, test city, TT 12345
That's obviously not a real address, but at least it requires the entry of the correct format. The data is coming from people answering a question on a form, so it will always be just an address, no names. Plus they're all address is one area South Florida, where pretty much all addresses will match this format. The problem I'm having is people not entering a city, or commas, so I want to give them an error if they don't. So far, I've found this
^([0-9a-zA-Z]+)(,\s*[0-9a-zA-Z]+)*$
But that doesn't allow for multiple words between the commas, or the capital letters and numbers for zip. Any help would save me a lot of headaches, and I would greatly appreciate it.
There really is a lot to consider when dealing with a street address--more than you can meaningfully deal with using a regular expression. Besides, if a human being is at a keyboard, there's always a high likelihood of typing mistakes, and there just isn't a regex that can account for all possible human errors.
Also, depending on what you intend to do with the address once you receive it, there's all sorts of helpful information you might need that you wouldn't get just from splitting the rough address components with a regex.
As a software developer at SmartyStreets (disclosure), I've learned that regular expressions really are the wrong tool for this job because addresses aren't as 'regular' (standardized) as you might think. There are more rigorous validation tools available, even plugins you can install on your web form to validate the address as it is typed, and which return a wealth of of useful metadata and information.
Try Regex:
\d{1,6}\s(?:[A-Za-z0-9#]+\s){0,7}(?:[A-Za-z0-9#]+,)\s*(?:[A-Za-z]+\s){0,3}(?:[A-Za-z]+,)\s*[A-Z]{2}\s*\d{5}
See Demo
Accepts Apt# also:
(^[0-9]{1,5}\s)([A-Za-z]{1,}(\#\s|\s\#|\s\#\s|\s)){1,5}([A-Za-z]{1,}\,|[0-9]{1,}\,)(\s[a-zA-Z]{1,}\,|[a-zA-Z]{1,}\,)(\s[a-zA-Z]{2}\s|[a-zA-Z]{2}\s)([0-9]{5})

Find domain and remove other parts from URL

I have a list of domain name with parameters
www.frontdir.com/index.php?adds1205
centurydirectory.com/submit/
www.directoryhigher.com/index.php?filec-linkapproval&x_response_code1
I need to find other parts with domain and I have to replace those parts.
Finally my result should look as follows.
Expected result:
www.frontdir.com
centurydirectory.com
www.directoryhigher.com
I tried the following regex
/([^/\?]+)\?
but can not able select after " ? "
How can I attain this result?
How about replacing
\/.*$
with an empty string?
I'm assuming here that you have one URL per line (your example suggests as much) and that you want to keep just the domains (again, as per your example).

How to use environments for lookups

My question builds upon the topic of matching a string against multiple patterns. One solution discussed here is to use sapply(keywords, grepl, strings, ignore.case=TRUE) which yields a two-dimensional matrix.
However, I run into significant speed issues, when applying this approach to 5K+ keywords and 60K+ strings..(I cancelled the process after 12hrs).
One idea is to use hash tables, or environments in R. However, I don't get how "translate/convert" my strings into an environment while keeping the numerical index?
I have strings[1]... till strings[60000]
e <- new.env(hash=TRUE)
for (i in 1:length(strings)) {
assign(x=i, value=strings, envir=e)
}
As x in assign must be a character, I can't use it like this, but I hope you get my idea..I want to be able to index the environment with the same numbers like in my string[...] vector
Thanks for your help!
R environments are not used as much as perl hashes are, I think
just because there are not widely understood 'idioms' for doing
so. In your case the key question is, do you really want the
numerical index? If so it should be the value. The key is your
string, that's the whole point of the exercise.
e <- new.env(hash=T)
strings <- as.character(chickwts$feed) # note! not unique
sapply(1:length(strings), function(i)assign(strings[i], i, e))
e$horsebean # returns 10
In this example only the last index associated with each string
is kept, but you can assign anything that might be useful to each
key, such as a vector of indices.
You can then lookup your data in a number of ways. You can regex search
for keys using ls, for example, and retrieve the values using mget():
# find all keys containing 'beans'
ls(e, patt='bean')
# retrieve bean data
mget(ls(e, pat='bean'),e)

Regex - Extract number from a link

I have this link www.xxx.yy/yyy/zzzzzz/xyz-z-yzy-/93797038 and I want to take the number 93797038 in order to pass it into another link.
For example: I want afterwards something like www.m.xxx.yy/93797038 which is the same page as before but in its mobile version.
In general, I know that I have to type www.xxx.yy/(.*) for extracting anything following the in the main url and then I group the result with www.m.xxx.yy/%1 which redirects to the same page but in the mobile version.
Any ideas how to do it?
EDIT: The link www.xxx.yy/yyy/zzzzzz/xyz-z-yzy-/93797038 is automated. The part that is the same each time is only the www.xxx.yy . Every time the system runs produces different urls. I want each time to take the number from those urls, e.g. the 93797038 in my case.
\/(\d+?)$ will get the trailing digits after the final /.
Why you want regex? You can use
string str = #"www.xxx.yy/yyy/zzzzzz/xyz-z-yzy-/93797038";
string digit = str.Split('/').Last();
instead.

Fast string matching algorithm with simple wildcards support

I need to match input strings (URLs) against a large set (anywhere from 1k-250k) of string rules with simple wildcard support.
Requirements for wildcard support are as follows:
Wildcard (*) can only substitute a "part" of a URL. That is fragments of a domain, path, and parameters. For example, "*.part.part/*/part?part=part&part=*". The only exception to this rule is in the path area where "/*" should match anything after the slash.
Examples:
*.site.com/* -- should match sub.site.com/home.html, sub2.site.com/path/home.html
sub.site.*/path/* -- should match sub.site.com/path/home.html, sub.site.net/path/home.html, but not sub.site.com/home.html
Additional requirements:
Fast lookup (I realize "fast" is a relative term. Given the max 250k rules, still fall within < 1.5s if possible.)
Work within the scope of a modern desktop (e.g. not a server implementation)
Ability to return 0:n matches given a input string
Matches will have rule data attached to them
What is the best system/algorithm for such as task? I will be developing the solution in C++ with the rules themselves stored in a SQLite database.
First of all, one of the worst performing searches you can do is with a wildcard at both ends of the string ".domain.com/path" -- and I think you're going to hit this case a lot. So my first recommendation is to reverse the order of the domains as they're stored in your DB: com.domain.example/path1/path2/page.html. That will allow you to keep things much more tidy and only use wildcards in "one direction" on the string, which will provide MUCH faster lookups.
I think John mentions some good points about how to do this all within your DB. If that doesn't work I would use a regex library in C++ against the list. I bet you'll get the best performance and most general regex syntax that way.
If I'm not mistaken, you can take string rule and break it up into domain, path, and query pieces, just like it's a URL. Then you can apply a standard wildcard matching algorithm with each of those pieces against the corresponding pieces from the URLs you want to test against. If all of the pieces match, the rule is a match.
Example
Rule: *.site.com/*
domain => *.site.com
path => /*
query => [empty]
URL: sub.site.com/path/home.html
domain => sub.site.com
path => /path/home.html
query => [empty]
Matching process:
domain => *.site.com matches sub.site.com? YES
path => /* matches /path/home.html? YES
query => [empty] matches [empty] YES
Result: MATCH
As you are storing the rules in a database I would store them already broken into those three pieces. And if you want uber-speed you could convert the *'s to %'s and then use the database's native LIKE operation to do the matching for you. Then you'd just have a query like
SELECT *
FROM ruleTable
WHERE #urlDomain LIKE ruleDomain
AND #urlPath LIKE rulePath
AND #urlQuery LIKE ruleQuery
where #urlDomain, #urlPath, and #urlQuery are variables in a prepared statement. The query would return the rules that match a URL, or an empty result set if nothing matches.