I'm stack in making my regex work in Python3.5.
I have a list which contains a lot of URLs.
Some URLs are short, others are long.
I could excerpt URLs I wanted...mostly but only this URL cannot be excerpted.
http://www.forbes.com/sites/julianmitchell/2016/09/27/this-startup-uses-drones-to-map-and-manage-massive-construction-projects/#1ca4d634334e
Here is the code.
urlList=[] # Assume there are many URLs in this list.
interdrone = re.compile(r"http://www.interdrone.com/news/(?:.*)")
hp = re.compile(r"http://www.interdrone.com/$")
restOfThem=re.compile(r'\#|youtube|bzmedia|facebook|twitter|mailto|geoconnexion.com|linkedin|gplus|resources\.sdtimes\.com|precisionagvision')
cleanuplist =[] # Adding URLs I need to this new list.
for i in range(0,len(urlList)):
if restOfThem.findall(ursList[i]):
continue
elif hp.findall(urlList[i]):
continue
elif interdrone.findall(urlList[i]):
cleanuplist.append(urlList[i])
else:
cleanuplist.append(urlList[i])
logmsg("Generated Interdrone clean URL list")
return (cleanuplist)
forbes.com URL should fall into "else:" clause, so it should be added to cleanuplist. However, it is not. Again, only this one is not added to the new list.
I tried to pick specifically Forbes site by this,
forbes = re.compile(r"http://www.forbes.com/(?:.*)")
then, add following elif statement.
elif forbes.findall(urlList[i]):
cleanuplist.append(urlList[i])
However, it also does not pick up forbes site.
Therefore, I come to doubt there is some kind of maximum boundary of character to apply regex (so that findall is skipped?).
I could be wrong. How can I excerpt forbes.com site above?
Your regex matches the URL you provided, specifically the # that's present in the last part of your URL. That's why it is skipped. There is no "character limit" (unless Python runs out of memory).
You need to be more restrictive with the regex. For example, what if your URL had been http://www.forbes.com/sites/julianmitchell/2016/09/27/twitter-stock-down - should it have matched the twitter part of your regex?
Also, you probably want to use re.search(), not re.findall().
Furthermore, you don't seem to need the last elif clause since the same thing will happen whether it's true or not.
Lastly, the correct way to iterate would be for url in urlList: instead of using indexes. This is Python, not Java.
Related
I'm trying to write my own Jekyll plugin to construct an api query from a custom tag. I've gotten as far as creating the basic plugin and tag, but I've run into the limits of my programming skills so looking to you for help.
Here's my custom tag for reference:
{% card "Arbor Elf | M13" %}
Here's the progress on my plugin:
module Jekyll
class Scryfall < Liquid::Tag
def initialize(tag_name, text, tokens)
super
#text = text
end
def render(context)
# Store the name of the card, ie "Arbor Elf"
#card_name =
# Store the name of the set, ie "M13"
#card_set =
# Build the query
#query = "https://api.scryfall.com/cards/named?exact=#{#card_name}&set=#{#card_set}"
# Store a specific JSON property
#card_art =
# Finally we render out the result
"<img src='#{#card_art}' title='#{#card_name}' />"
end
end
end
Liquid::Template.register_tag('cards', Jekyll::Scryfall)
For reference, here's an example query using the above details (paste it into your browser to see the response you get back)
https://api.scryfall.com/cards/named?exact=arbor+elf&set=m13
My initial attempts after Googling around was to use regex to split the #text at the |, like so:
#card_name = "#{#text}".split(/| */)
This didn't quite work, instead it output this:
[“A”, “r”, “b”, “o”, “r”, “ “, “E”, “l”, “f”, “ “, “|”, “ “, “M”, “1”, “3”, “ “]
I'm also then not sure how to access and store specific properties within the JSON response. Ideally, I can do something like this:
#card_art = JSONRESPONSE.image_uri.large
I'm well aware I'm asking a lot here, but I'd love to try and get this working and learn from it.
Thanks for reading.
Actually, your split should work – you just need to give it the correct regex (and you can call that on #text directly). You also need to escape the pipe character in the regex, because pipes can have special meaning. You can use rubular.com to experiment with regexes.
parts = #text.split(/\|/)
# => => ["Arbor Elf ", " M13"]
Note that they also contain some extra whitespace, which you can remove with strip.
#card_name = parts.first.strip
#card_set = parts.last.strip
This might also be a good time to answer questions like: what happens if the user inserts multiple pipes? What if they insert none? Will your code give them a helpful error message for this?
You'll also need to escape these values in your URL. What if one of your users adds a card containing a & character? Your URL will break:
https://api.scryfall.com/cards/named?exact=Sword of Dungeons & Dragons&set=und
That looks like a URL with three parameters, exact, set and Dragons. You need to encode the user input to be included in a URL:
require 'cgi'
query = "https://api.scryfall.com/cards/named?exact=#{CGI.escape(#card_name)}&set=#{CGI.escape(#card_set)}"
# => "https://api.scryfall.com/cards/named?exact=Sword+of+Dungeons+%26+Dragons&set=und"
What comes after that is a little less clear, because you haven't written the code yet. Try making the call with the Net::HTTP module and then parsing the response with the JSON module. If you have trouble, come back here and ask a new question.
I am trying to pass the first part of a django url to a view, so I can filter my results by the term in the url.
Looking at the documentation, it seems quite straightforward.
However, I have the following urls.py
url('<colcat>/collection/(?P<name>[\w\-]+)$', views.collection_detail, name='collection_detail'),
url('<colcat>/', views.collection_view, name='collection_view'),
In this case, I want to be able to go to /living and have living be passed to my view so that I can use it to filter by.
When trying this however, no matter what url I put it isn't being matched, and I get an error saying the address I put in could not be matched to any urls.
What am I missing?
<colcat> is not a valid regex. You need to use the same format as you have for name.
url('(?P<colcat>[\w\-]+)/collection/(?P<name>[\w\-]+)$', views.collection_detail, name='collection_detail'),
url('(?P<colcat>[\w\-]+)/$', views.collection_view, name='collection_view'),
Alternatively, use the new path form which will be much simpler:
path('<str:colcat>/collection/<str:name>', views.collection_detail, name='collection_detail'),
path('<str:colcat>/', views.collection_view, name='collection_view'),
Is there a way to check if part of the URL contains a certain string:
Eg. <% if current_spree_page?("/products/*") %>, where * could be anything?
I tested, and gmacdougall's answer works, I had already found a solution though.
This is what I used to render different layouts depending on what the url is:
url = request.path_info
if url.include?('products')
render :layout => 'product_layout'
else
render :layout => 'layout'
end
The important thing to note is that different pages will call different methods within the controller (eg. show, index). What I did was put this code in its own method and then I am calling that method where needed.
If you are at a place where you have access to the ActionDispatch::Request you can do the following:
request.path.start_with?('/products')
You can use include? method
my_string = "abcdefg"
if my_string.include? "cde"
puts "String includes 'cde'"
end`
Remember that include? is case sensetive. So if my_string in the example above would be something like "abcDefg" (with an uppercase D), include?("cde") would return false. You may want to do a downcase() before calling include?()
The other answers give absolutely the cleanest ways of checking your URL. I want to share a way of doing this using a regular expression so you can check your URL for a string at a particular location in the URL.
This method is useful when you have locales as first part of your URL like /en/users.
module MenuHelper
def is_in_begin_path(*url_parts)
url_parts.each do |url_part|
return true if request.path.match(/^\/\w{2}\/#{url_part}/).present?
end
false
end
end
This helper method picks out the part after the second slash if the first part contains 2 word characters as is the case if you use locales. Drop this in your ApplicationController to have it available anywhere.
Example:
is_in_begin_path('users', 'profile')
That matches /en/users/4, /en/profile, /nl/users/9/statistics, /nl/profile etc.
I have a text file having multiple URLs with other information of the URL. How can I read the txt file and save the URLs only in an array to download it? I want to use
C = textscan(fileId, formatspec);
What should I mention in formatspec for URL as format?
This is not a job for textscan; you should use regular expressions for this. In MATLAB, regexes are described here.
For URLs, also refer here or here for examples in other languages.
Here's an example in MATLAB:
% This string is obtained through textscan or something
str = {...
'pre-URL garbage http://www.example.com/index.php?query=test&otherStuf=info more stuff here'
'other foolish stuff ftp://localhost/home/ruler_of_the_world/awesomeContent.py 1 2 3 4 misleading://';
};
% find URLs
C = regexpi(str, ...
['((http|https|ftp|file)://|www\.|ftp\.)',...
'[-A-Z0-9+&##/%=~_|$?!:,.]*[A-Z0-9+&##/%=~_|$]'], 'match');
C{:}
Result:
ans =
'http://www.example.com/index.php?query=test&otherStuf=info'
ans =
'ftp://localhost/home/ruler_of_the_world/awesomeContent.py'
Note that this regex requires you to have the protocol included, or have a leading www. or ftp.. Something like example.com/universal_remote.cgi?redirect= is NOT matched.
You could go on and make the regex cover more and more cases. However, eventually you'll stumble upon the the most important conclusion (as made here for example; where I got my regex from): given the full definition of what precisely constitutes a valid URL, there is no single regex able to always match every valid URL. That is, there are valid URLs you can dream up that are not captured by any of the regexes shown.
But please keep in mind that this last statement is more theoretical rather than practical -- those non-matchable URLs are valid but not often encountered in practice :) In other words, if your URLs have a pretty standard form, you're pretty much covered with the regex I gave you.
Now, I fooled around a bit with the Java suggestion by pm89. As I suspected, it is an order of magnitude slower than just a regex, since you introduce another "layer of goo" to the code (in my timings, the difference was about 40x slower, excluding the imports). Here's my version:
import java.net.URL;
import java.net.MalformedURLException;
str = {...
'pre-URL garbage http://www.example.com/index.php?query=test&otherStuf=info more stuff here'
'pre--URL garbage example.com/index.php?query=test&otherStuf=info more stuff here'
'other foolish stuff ftp://localhost/home/ruler_of_the_world/awesomeContent.py 1 2 3 4 misleading://';
};
% Attempt to convert each item into an URL.
for ii = 1:numel(str)
cc = textscan(str{ii}, '%s');
for jj = 1:numel(cc{1})
try
url = java.net.URL(cc{1}{jj})
catch ME
% rethrow any non-url related errors
if isempty(regexpi(ME.message, 'MalformedURLException'))
throw(ME);
end
end
end
end
Results:
url =
'http://www.example.com/index.php?query=test&otherStuf=info'
url =
'ftp://localhost/home/ruler_of_the_world/awesomeContent.py'
I'm not too familiar with java.net.URL, but apparently, it is also unable to find URLs without leading protocol or standard domain (e.g., example.com/path/to/page).
This snippet can undoubtedly be improved upon, but I would urge you to consider why you'd want to do this for this longer, inherently slower and far uglier solution :)
As I suspected you could use java.net.URL according to this answer.
To implement the same code in Matlab:
First read the file into a string, using fileread for example:
str = fileread('Sample.txt');
Then split the text with respect to spaces, using strsplit:
spl_str = strsplit(str);
Finally use java.net.URL to detect the URLs:
for k = 1:length(spl_str)
try
url = java.net.URL(spl_str{k})
% Store or save the URL contents here
catch e
% it's not a URL.
end
end
You can write the URL contents into a file using urlwrite. But first convert the URLs obtained from java.net.URL to char:
url = java.net.URL(spl_str{k});
urlwrite(char(url), 'test.html');
Hope it helps.
I am looking for a list of (doesnt matter if its not all, just needs to be big as its for generating dummy data)
Im looking for a list like
.net.nz
.co.nz
.edu.nz
.govt.nz
.com.au
.govt.au
.com
.net
any ideas where I can locate a list?
There are answers here. Most of them are relating to the use of http://publicsuffix.org/, and even some implementations to use it were given in some languages, like Ruby.
To get all the ICANN domains, this python code should work for you:
import requests
url = 'https://publicsuffix.org/list/public_suffix_list.dat'
page = requests.get(url)
icann_domains = []
for line in page.text.splitlines():
if 'END ICANN DOMAINS' in line:
break
elif line.startswith('//'):
continue
else:
domain = line.strip()
if domain:
icann_domains.append(domain)
print(len(icann_domains)) # 7334 as of Nov 2018
Remove the break statement to get private domains as well.
Be careful as you will get some domains like this: *.kh (e.g. http://www.mptc.gov.kh/dns_registration.htm). The * is a wildcard.