list of 2ld.1ld (2nd level domain.1st(top)level domain) names? - list

I am looking for a list of (doesnt matter if its not all, just needs to be big as its for generating dummy data)
Im looking for a list like
.net.nz
.co.nz
.edu.nz
.govt.nz
.com.au
.govt.au
.com
.net
any ideas where I can locate a list?

There are answers here. Most of them are relating to the use of http://publicsuffix.org/, and even some implementations to use it were given in some languages, like Ruby.

To get all the ICANN domains, this python code should work for you:
import requests
url = 'https://publicsuffix.org/list/public_suffix_list.dat'
page = requests.get(url)
icann_domains = []
for line in page.text.splitlines():
if 'END ICANN DOMAINS' in line:
break
elif line.startswith('//'):
continue
else:
domain = line.strip()
if domain:
icann_domains.append(domain)
print(len(icann_domains)) # 7334 as of Nov 2018
Remove the break statement to get private domains as well.
Be careful as you will get some domains like this: *.kh (e.g. http://www.mptc.gov.kh/dns_registration.htm). The * is a wildcard.

Related

Python3 - filter text before writing to a file

This is a function that is called by my main.py script.
Here is the issue, I have a file with a list of IP addresses, and I query this API in order to find me reverse DNS lookups of those given IPs against the (url) variable. Then spits out the "response.text".
Well in the response.text file,
I get No DNS A records found for 96.x.x.x
Other data I get is just a dnsname: 'subdomain.domain.com'
How do I filter my results to NOT show for every 'No DNS A records found for (whateverip shows)'
#!/usr/bin/env python3
import requests
import pdb
#function to use hackertarget api for reverse dns lookup
def dns_finder(file_of_ips="endpointslist"):
print('\n\n########## Finding DNS Names ##########\n')
targets = open("TARGETS","w")
with open (file_of_ips) as f:
for line in f:
line.strip()
url = 'https://api.hackertarget.com/reverseiplookup/?q=' + line
response = requests.get(url)
#pdb.set_trace()
hits = response.text
print(hits)
targets.write(hits)
return hits
You can use the re module to check the response to make sure it is a valid DNS hostname (using a regular expression pattern):
import re
# ... rest of your code
# before writing to targets file
if re.match('^(([a-zA-Z0-9]|[a-zA-Z0-9][a-zA-Z0-9\-]*[a-zA-Z0-9])\.)*([A-Za-z0-9]|[A-Za-z0-9][A-Za-z0-9\-]*[A-Za-z0-9])$', hits) is not None:
targets.write(hits)
The re.match() function will return a match object if the given string matches the regex. Otherwise, it will return None. So if you check to make sure it is not None, then it should have matched the regex (i.e. it matches the pattern for a DNS hostname). You could alter the regex or abstract the check to a more generic isValidHit(hit) and put any functionality you want there to check for more than just DNS hostnames.
Note
You want to be careful with your line targets = open("TARGETS","w"). That opens the file for writing, but you need to manually close the file elsewhere and handle any errors involved with IO to it.
You should try to use with open(filename) as blah: whenever dealing with file IO. It is much safer and easier to maintain.
Edit
Simple regex for IPv4 address found here.
Add more explanation of the regex match function.
Edit 2
I just re-read your post and saw that you are actually not looking for IPv4 addresses, but actually DNS hostnames...
In that case you could simply switch the regex pattern to ^(([a-zA-Z0-9]|[a-zA-Z0-9][a-zA-Z0-9\-]*[a-zA-Z0-9])\.)*([A-Za-z0-9]|[A-Za-z0-9][A-Za-z0-9\-]*[A-Za-z0-9])$. Source for the DNS Hostname regex.
#Tim Klein answer is great if you want to keep only the domain names. Howerver, as stated in the question, if you just want to remove error messages like "No DNS A records found for ...", then you can use the str.startswith method:
#!/usr/bin/env python3
import requests
import pdb
#function to use hackertarget api for reverse dns lookup
def dns_finder(file_of_ips="endpointslist"):
print('\n\n########## Finding DNS Names ##########\n')
targets = open("TARGETS","w")
with open (file_of_ips) as f:
for line in f:
line.strip()
url = 'https://api.hackertarget.com/reverseiplookup/?q=' + line
response = requests.get(url)
#pdb.set_trace()
hits = response.text
print(hits)
if not hits.startswith("No DNS A records found for"):
targets.write(hits)
return hits # why iterating if you return there ?

Is there any maximum length of character regex can handle?

I'm stack in making my regex work in Python3.5.
I have a list which contains a lot of URLs.
Some URLs are short, others are long.
I could excerpt URLs I wanted...mostly but only this URL cannot be excerpted.
http://www.forbes.com/sites/julianmitchell/2016/09/27/this-startup-uses-drones-to-map-and-manage-massive-construction-projects/#1ca4d634334e
Here is the code.
urlList=[] # Assume there are many URLs in this list.
interdrone = re.compile(r"http://www.interdrone.com/news/(?:.*)")
hp = re.compile(r"http://www.interdrone.com/$")
restOfThem=re.compile(r'\#|youtube|bzmedia|facebook|twitter|mailto|geoconnexion.com|linkedin|gplus|resources\.sdtimes\.com|precisionagvision')
cleanuplist =[] # Adding URLs I need to this new list.
for i in range(0,len(urlList)):
if restOfThem.findall(ursList[i]):
continue
elif hp.findall(urlList[i]):
continue
elif interdrone.findall(urlList[i]):
cleanuplist.append(urlList[i])
else:
cleanuplist.append(urlList[i])
logmsg("Generated Interdrone clean URL list")
return (cleanuplist)
forbes.com URL should fall into "else:" clause, so it should be added to cleanuplist. However, it is not. Again, only this one is not added to the new list.
I tried to pick specifically Forbes site by this,
forbes = re.compile(r"http://www.forbes.com/(?:.*)")
then, add following elif statement.
elif forbes.findall(urlList[i]):
cleanuplist.append(urlList[i])
However, it also does not pick up forbes site.
Therefore, I come to doubt there is some kind of maximum boundary of character to apply regex (so that findall is skipped?).
I could be wrong. How can I excerpt forbes.com site above?
Your regex matches the URL you provided, specifically the # that's present in the last part of your URL. That's why it is skipped. There is no "character limit" (unless Python runs out of memory).
You need to be more restrictive with the regex. For example, what if your URL had been http://www.forbes.com/sites/julianmitchell/2016/09/27/twitter-stock-down - should it have matched the twitter part of your regex?
Also, you probably want to use re.search(), not re.findall().
Furthermore, you don't seem to need the last elif clause since the same thing will happen whether it's true or not.
Lastly, the correct way to iterate would be for url in urlList: instead of using indexes. This is Python, not Java.

How do I sort Flask/Werkzeug routes in the order in which they will be matched?

I have a function that I'm using to list routes for my Flask site and would like to be sure it is sorting them in the order in which they will be matched by Flask/Werkzeug.
Currently I have
def routes(verbose, wide, nostatic):
"""List routes supported by the application"""
for rule in sorted(app.url_map.iter_rules()):
if nostatic and rule.endpoint == 'static':
continue
if verbose:
fmt = "{:45s} {:30s} {:30s}" if wide else "{:35s} {:25s} {:25s}"
line = fmt.format(rule, rule.endpoint, ','.join(rule.methods))
else:
fmt = "{:45s}" if wide else "{:35s}"
line = fmt.format(rule)
print(line)
but this just seems to sort the routes in the order I've defined them in my code.
How do I sort Flask/Werkzeug routes in the order in which they will be matched?
Another approach: Specifying a host and a path and seeing what rule applies.
urls = app.url_map.bind(host)
try:
m = urls.match(match, "GET")
z = '{}({})'.format(m[0], ','.join(["{}='{}'".format(arg, m[1][arg]) for arg in m[1]] +
["host='{}'".format(host)]))
return z
except NotFound:
return
On each request Flask/Werkzeug reordered self.map.update() rules if it need with route weights (see my another answer URL routing conflicts for static files in Flask dev server).
You can found that Map.update method update _rules (see https://github.com/mitsuhiko/werkzeug/blob/master/werkzeug/routing.py#L1313).
So you can do the same thing:
sorted(current_app.url_map._rules, key=lambda x: x.match_compare_key())
Or if all routes already loaded and any request done (called Map.build or Map.match) just use app.url_map._rules - it will be sorted. In request app.url_map._rules already sorted, because Map.match was called before dispatcher (match rule with dispatcher).
You also can found that Map.update called in Map.iter_rules, so you can just use this method current_app.url_map.iter_rules() (see https://github.com/mitsuhiko/werkzeug/blob/master/werkzeug/routing.py#L1161).

how to read only URL from txt file in MATLAB

I have a text file having multiple URLs with other information of the URL. How can I read the txt file and save the URLs only in an array to download it? I want to use
C = textscan(fileId, formatspec);
What should I mention in formatspec for URL as format?
This is not a job for textscan; you should use regular expressions for this. In MATLAB, regexes are described here.
For URLs, also refer here or here for examples in other languages.
Here's an example in MATLAB:
% This string is obtained through textscan or something
str = {...
'pre-URL garbage http://www.example.com/index.php?query=test&otherStuf=info more stuff here'
'other foolish stuff ftp://localhost/home/ruler_of_the_world/awesomeContent.py 1 2 3 4 misleading://';
};
% find URLs
C = regexpi(str, ...
['((http|https|ftp|file)://|www\.|ftp\.)',...
'[-A-Z0-9+&##/%=~_|$?!:,.]*[A-Z0-9+&##/%=~_|$]'], 'match');
C{:}
Result:
ans =
'http://www.example.com/index.php?query=test&otherStuf=info'
ans =
'ftp://localhost/home/ruler_of_the_world/awesomeContent.py'
Note that this regex requires you to have the protocol included, or have a leading www. or ftp.. Something like example.com/universal_remote.cgi?redirect= is NOT matched.
You could go on and make the regex cover more and more cases. However, eventually you'll stumble upon the the most important conclusion (as made here for example; where I got my regex from): given the full definition of what precisely constitutes a valid URL, there is no single regex able to always match every valid URL. That is, there are valid URLs you can dream up that are not captured by any of the regexes shown.
But please keep in mind that this last statement is more theoretical rather than practical -- those non-matchable URLs are valid but not often encountered in practice :) In other words, if your URLs have a pretty standard form, you're pretty much covered with the regex I gave you.
Now, I fooled around a bit with the Java suggestion by pm89. As I suspected, it is an order of magnitude slower than just a regex, since you introduce another "layer of goo" to the code (in my timings, the difference was about 40x slower, excluding the imports). Here's my version:
import java.net.URL;
import java.net.MalformedURLException;
str = {...
'pre-URL garbage http://www.example.com/index.php?query=test&otherStuf=info more stuff here'
'pre--URL garbage example.com/index.php?query=test&otherStuf=info more stuff here'
'other foolish stuff ftp://localhost/home/ruler_of_the_world/awesomeContent.py 1 2 3 4 misleading://';
};
% Attempt to convert each item into an URL.
for ii = 1:numel(str)
cc = textscan(str{ii}, '%s');
for jj = 1:numel(cc{1})
try
url = java.net.URL(cc{1}{jj})
catch ME
% rethrow any non-url related errors
if isempty(regexpi(ME.message, 'MalformedURLException'))
throw(ME);
end
end
end
end
Results:
url =
'http://www.example.com/index.php?query=test&otherStuf=info'
url =
'ftp://localhost/home/ruler_of_the_world/awesomeContent.py'
I'm not too familiar with java.net.URL, but apparently, it is also unable to find URLs without leading protocol or standard domain (e.g., example.com/path/to/page).
This snippet can undoubtedly be improved upon, but I would urge you to consider why you'd want to do this for this longer, inherently slower and far uglier solution :)
As I suspected you could use java.net.URL according to this answer.
To implement the same code in Matlab:
First read the file into a string, using fileread for example:
str = fileread('Sample.txt');
Then split the text with respect to spaces, using strsplit:
spl_str = strsplit(str);
Finally use java.net.URL to detect the URLs:
for k = 1:length(spl_str)
try
url = java.net.URL(spl_str{k})
% Store or save the URL contents here
catch e
% it's not a URL.
end
end
You can write the URL contents into a file using urlwrite. But first convert the URLs obtained from java.net.URL to char:
url = java.net.URL(spl_str{k});
urlwrite(char(url), 'test.html');
Hope it helps.

Codeigniter route regex - match any string except 'admin'

I'd like to send any route that doesn't match an admin route, to my "event" controller. This seems to be a fairly common requirement and a cursory search throws up all sorts of similar questions.
The solution, as I understand, seems to be using a negative lookahead in the regex. So my attempt looks like this:
$route['(?!admin).*'] = "event";
..which works. Well, sort of. It does send any non-admin request to my "event" controller, but I need it to pass the actual string that was matched: so /my-new-event/ is routed to /event/my-new-event/
I tried:
$route['(?!admin).*'] = "event/$0";
$route['(?!admin).*'] = "event/$1";
$route['(?!admin)(.*)'] = "event/$0";
$route['(?!admin)(.*)'] = "event/$1";
... and a few other increasingly random and desperate permutations. All result in a 404 page.
What's the correct syntax for passing the matched string to the controller?
Thanks :)
I don't think you can do "negative routing".
But as routes do have an order : "routes will run in the order they are defined. Higher routes will always take precedence over lower ones." I would do my admin one first then anything else.
If I suppose your admin path is looking like "/admin/..." I would suggest :
$route['admin/(:any)'] = "admincontroller/$1";
$route['(:any)'] = "event/$1";