how to read only URL from txt file in MATLAB - regex

I have a text file having multiple URLs with other information of the URL. How can I read the txt file and save the URLs only in an array to download it? I want to use
C = textscan(fileId, formatspec);
What should I mention in formatspec for URL as format?

This is not a job for textscan; you should use regular expressions for this. In MATLAB, regexes are described here.
For URLs, also refer here or here for examples in other languages.
Here's an example in MATLAB:
% This string is obtained through textscan or something
str = {...
'pre-URL garbage http://www.example.com/index.php?query=test&otherStuf=info more stuff here'
'other foolish stuff ftp://localhost/home/ruler_of_the_world/awesomeContent.py 1 2 3 4 misleading://';
};
% find URLs
C = regexpi(str, ...
['((http|https|ftp|file)://|www\.|ftp\.)',...
'[-A-Z0-9+&##/%=~_|$?!:,.]*[A-Z0-9+&##/%=~_|$]'], 'match');
C{:}
Result:
ans =
'http://www.example.com/index.php?query=test&otherStuf=info'
ans =
'ftp://localhost/home/ruler_of_the_world/awesomeContent.py'
Note that this regex requires you to have the protocol included, or have a leading www. or ftp.. Something like example.com/universal_remote.cgi?redirect= is NOT matched.
You could go on and make the regex cover more and more cases. However, eventually you'll stumble upon the the most important conclusion (as made here for example; where I got my regex from): given the full definition of what precisely constitutes a valid URL, there is no single regex able to always match every valid URL. That is, there are valid URLs you can dream up that are not captured by any of the regexes shown.
But please keep in mind that this last statement is more theoretical rather than practical -- those non-matchable URLs are valid but not often encountered in practice :) In other words, if your URLs have a pretty standard form, you're pretty much covered with the regex I gave you.
Now, I fooled around a bit with the Java suggestion by pm89. As I suspected, it is an order of magnitude slower than just a regex, since you introduce another "layer of goo" to the code (in my timings, the difference was about 40x slower, excluding the imports). Here's my version:
import java.net.URL;
import java.net.MalformedURLException;
str = {...
'pre-URL garbage http://www.example.com/index.php?query=test&otherStuf=info more stuff here'
'pre--URL garbage example.com/index.php?query=test&otherStuf=info more stuff here'
'other foolish stuff ftp://localhost/home/ruler_of_the_world/awesomeContent.py 1 2 3 4 misleading://';
};
% Attempt to convert each item into an URL.
for ii = 1:numel(str)
cc = textscan(str{ii}, '%s');
for jj = 1:numel(cc{1})
try
url = java.net.URL(cc{1}{jj})
catch ME
% rethrow any non-url related errors
if isempty(regexpi(ME.message, 'MalformedURLException'))
throw(ME);
end
end
end
end
Results:
url =
'http://www.example.com/index.php?query=test&otherStuf=info'
url =
'ftp://localhost/home/ruler_of_the_world/awesomeContent.py'
I'm not too familiar with java.net.URL, but apparently, it is also unable to find URLs without leading protocol or standard domain (e.g., example.com/path/to/page).
This snippet can undoubtedly be improved upon, but I would urge you to consider why you'd want to do this for this longer, inherently slower and far uglier solution :)

As I suspected you could use java.net.URL according to this answer.
To implement the same code in Matlab:
First read the file into a string, using fileread for example:
str = fileread('Sample.txt');
Then split the text with respect to spaces, using strsplit:
spl_str = strsplit(str);
Finally use java.net.URL to detect the URLs:
for k = 1:length(spl_str)
try
url = java.net.URL(spl_str{k})
% Store or save the URL contents here
catch e
% it's not a URL.
end
end
You can write the URL contents into a file using urlwrite. But first convert the URLs obtained from java.net.URL to char:
url = java.net.URL(spl_str{k});
urlwrite(char(url), 'test.html');
Hope it helps.

Related

How can I use regex to construct an API call in my Jekyll plugin?

I'm trying to write my own Jekyll plugin to construct an api query from a custom tag. I've gotten as far as creating the basic plugin and tag, but I've run into the limits of my programming skills so looking to you for help.
Here's my custom tag for reference:
{% card "Arbor Elf | M13" %}
Here's the progress on my plugin:
module Jekyll
class Scryfall < Liquid::Tag
def initialize(tag_name, text, tokens)
super
#text = text
end
def render(context)
# Store the name of the card, ie "Arbor Elf"
#card_name =
# Store the name of the set, ie "M13"
#card_set =
# Build the query
#query = "https://api.scryfall.com/cards/named?exact=#{#card_name}&set=#{#card_set}"
# Store a specific JSON property
#card_art =
# Finally we render out the result
"<img src='#{#card_art}' title='#{#card_name}' />"
end
end
end
Liquid::Template.register_tag('cards', Jekyll::Scryfall)
For reference, here's an example query using the above details (paste it into your browser to see the response you get back)
https://api.scryfall.com/cards/named?exact=arbor+elf&set=m13
My initial attempts after Googling around was to use regex to split the #text at the |, like so:
#card_name = "#{#text}".split(/| */)
This didn't quite work, instead it output this:
[“A”, “r”, “b”, “o”, “r”, “ “, “E”, “l”, “f”, “ “, “|”, “ “, “M”, “1”, “3”, “ “]
I'm also then not sure how to access and store specific properties within the JSON response. Ideally, I can do something like this:
#card_art = JSONRESPONSE.image_uri.large
I'm well aware I'm asking a lot here, but I'd love to try and get this working and learn from it.
Thanks for reading.
Actually, your split should work – you just need to give it the correct regex (and you can call that on #text directly). You also need to escape the pipe character in the regex, because pipes can have special meaning. You can use rubular.com to experiment with regexes.
parts = #text.split(/\|/)
# => => ["Arbor Elf ", " M13"]
Note that they also contain some extra whitespace, which you can remove with strip.
#card_name = parts.first.strip
#card_set = parts.last.strip
This might also be a good time to answer questions like: what happens if the user inserts multiple pipes? What if they insert none? Will your code give them a helpful error message for this?
You'll also need to escape these values in your URL. What if one of your users adds a card containing a & character? Your URL will break:
https://api.scryfall.com/cards/named?exact=Sword of Dungeons & Dragons&set=und
That looks like a URL with three parameters, exact, set and Dragons. You need to encode the user input to be included in a URL:
require 'cgi'
query = "https://api.scryfall.com/cards/named?exact=#{CGI.escape(#card_name)}&set=#{CGI.escape(#card_set)}"
# => "https://api.scryfall.com/cards/named?exact=Sword+of+Dungeons+%26+Dragons&set=und"
What comes after that is a little less clear, because you haven't written the code yet. Try making the call with the Net::HTTP module and then parsing the response with the JSON module. If you have trouble, come back here and ask a new question.

Chunk a colon in NLTK

I am trying to split a chunk at the position of a colon : in NLTK but it seems its a special case. In normal regex I can just put it in [:] no problems.
But in NLTK no matter what I do it does not like it in the regexParser.
from nltk import RegexpParser
grammar = r"""
NP: {<DT|PP\$>?<JJ>*<NN>|<NNP.*><\:><VBD>} # chunk (Rapunzel + : + let) together
{<NNP>+}
<.*>}{<VBD.*>
"""
cp = RegexpParser(grammar)
sentence = [("Rapunzel", "NNP"), (":",":"), ("let", "VBD"), ("down", "RP"), ("her", "PP$"), ("long", "JJ"), ("golden", "JJ"), ("hair", "NN")]
print(cp.parse(sentence))
The above code does make a chunk picking up the colon as a block.
<.*>}{<\VBD.*> line splits the chunk made up of (Rapunzel+:+let) at the position before let.
if you take out that split and replace with the colon it gives a error
from nltk import RegexpParser
grammar = r"""
NP: {<DT|PP\$>?<JJ>*<NN>|<NNP.*><\:><VBD>} # chunk (Rapunzel + : + let) together
{<NNP>+}
<.*>}{<\:.*>
"""
cp = RegexpParser(grammar)
sentence = [("Rapunzel", "NNP"), (":",":"), ("let", "VBD"), ("down", "RP"), ("her", "PP$"), ("long", "JJ"), ("golden", "JJ"), ("hair", "NN")]
print(cp.parse(sentence))
ValueError: Illegal chunk pattern: >
Can anyone explain how to do this, I tried Google and going through the docs but I am none the wiser. I can deal with this post chunk no problem, but I just got to know why or how. :-)
It seems that NLTK treats second colon for each chunk definition as an indicator to start a new chunk.
For those who get the same error, a workaround is to break down multiple regexes into multiple chunks with the same name.
Let's assume we have the following grammar:
grammar = r"""
SOME_CHUNK:
{<NN><:>}
{<JJ><:>}
"""
To fix this, change it to:
grammar = r"""
SOME_CHUNK: {<NN><:>}
SOME_CHUNK: {<JJ><:>}
"""
Unfortunately, this won't work if one is using chinking regex with another colon, like in your example.
To help you solve your specific issue please, post an exact sentence you are trying to parse. From your example it is hard to tell why you need |<NNP.*><\:><VBD> part at all.

Is there any maximum length of character regex can handle?

I'm stack in making my regex work in Python3.5.
I have a list which contains a lot of URLs.
Some URLs are short, others are long.
I could excerpt URLs I wanted...mostly but only this URL cannot be excerpted.
http://www.forbes.com/sites/julianmitchell/2016/09/27/this-startup-uses-drones-to-map-and-manage-massive-construction-projects/#1ca4d634334e
Here is the code.
urlList=[] # Assume there are many URLs in this list.
interdrone = re.compile(r"http://www.interdrone.com/news/(?:.*)")
hp = re.compile(r"http://www.interdrone.com/$")
restOfThem=re.compile(r'\#|youtube|bzmedia|facebook|twitter|mailto|geoconnexion.com|linkedin|gplus|resources\.sdtimes\.com|precisionagvision')
cleanuplist =[] # Adding URLs I need to this new list.
for i in range(0,len(urlList)):
if restOfThem.findall(ursList[i]):
continue
elif hp.findall(urlList[i]):
continue
elif interdrone.findall(urlList[i]):
cleanuplist.append(urlList[i])
else:
cleanuplist.append(urlList[i])
logmsg("Generated Interdrone clean URL list")
return (cleanuplist)
forbes.com URL should fall into "else:" clause, so it should be added to cleanuplist. However, it is not. Again, only this one is not added to the new list.
I tried to pick specifically Forbes site by this,
forbes = re.compile(r"http://www.forbes.com/(?:.*)")
then, add following elif statement.
elif forbes.findall(urlList[i]):
cleanuplist.append(urlList[i])
However, it also does not pick up forbes site.
Therefore, I come to doubt there is some kind of maximum boundary of character to apply regex (so that findall is skipped?).
I could be wrong. How can I excerpt forbes.com site above?
Your regex matches the URL you provided, specifically the # that's present in the last part of your URL. That's why it is skipped. There is no "character limit" (unless Python runs out of memory).
You need to be more restrictive with the regex. For example, what if your URL had been http://www.forbes.com/sites/julianmitchell/2016/09/27/twitter-stock-down - should it have matched the twitter part of your regex?
Also, you probably want to use re.search(), not re.findall().
Furthermore, you don't seem to need the last elif clause since the same thing will happen whether it's true or not.
Lastly, the correct way to iterate would be for url in urlList: instead of using indexes. This is Python, not Java.

how to fix bad decoding?

i am using shown below procedure to write (spidered from the web) a list of persons data.
I think the code itself is ok, but i am confused with results. Some chars are decoded properly, some not. For example:
STANIS?AWurodzony/a 01.01.1888, ?Ԅ?
HALINAurodzony/a 05.07.1927, ŁÓDŹ
the last word in both strings is the same!
Also ? sign used to replace non-translatable chars once is used once not:
STANISŁAWurodzony/a 24.03.1907, RAKSZANY
^
| here is written ok - not replaced
And here is the code:
def findPerson():
file = codecs.open('Names.txt','a','ISO-8859-1','replace')
try:
with codecs.open('./listNames_links.txt','r','ISO-8859-1','replace') as f:
line = f.readline()
while line != '':
#print line,
line = f.readline()
res = requests.get('http://real.address.gov.pl'+line)
res.raise_for_status()
soup = BeautifulSoup(res.text)
linkElems = soup.find('a','css_class_name').text
file.write(linkElems)
file.write('\r\n')#preserve end-of-line
question:
How to fix this. Is my procedure wrong? Or the source page has broken encoding? (I suppose it is ok, i can read it in browser without any errors.)
I'm not very familiar with python, but had similar problems with java programs. And in almost every case it was the problem that not the same encoding was used in all steps and thus while converting it results in ugly charcters.
I'd suggest to use UTF-8 during the whole process if possible.
An off-topic side remark: Because I stumpled upon this so very often, I bought myself this T-Shirt (correctly spelled Scheiß, German for crap) and wear it at work ;-)
This looks like a polish language, and doesn't look like multi-byte encoding such as UTF-8 is used. You are writing to your Names.txt file using ISO-8859-1 which doesn't support all polish characters.
You could use ISO-8859-2 for Polish characgter support, but even better would be to use UTF-8 which supports all languages and is the common standard on the web.
Try
file = codecs.open('Names.txt','a','UTF-8','replace')
When you make a request using Requests, try checking the encoding of each page. For example:
res = requests.get('http://real.address.gov.pl'+line)
res.encoding = 'utf-8'
soup = BeautifulSoup(res.text)

parsing URL in newspaper website

I have many urls from the same newspaper, each url has a depository for each writer.
For example:
http://alhayat.com/Opinion/Zinab-Ghasab.aspx
http://alhayat.com/Opinion/Abeer-AlFozan.aspx
http://www.alhayat.com/Opinion/Suzan-Mash-hadi.aspx
http://www.alhayat.com/Opinion/Thuraya-Al-Shahri.aspx
http://www.alhayat.com/Opinion/Badria-Al-Besher.aspx
Could someone help me please with writing a regular expression something that would generate all writers urls?
Thanks!
In order to get Zinab-Ghasab.aspx, you need no regex.
Just iterate through all of these URLs and use
print s[s.rfind("/")+1:]
See sample demo.
A regex would look like
print re.findall(r"/([^/]+)\.aspx", input)
It will get all your values from input without .aspx extension.
You can use findall() method in "re" module.
Assuming that you are reading the content from a file
import re
fp = open("file_name", "r")
contents = fp.read()
writer_urls = re.findall("https?://.+.com/.+/(.*).aspx", contents)
fp.close()
Now, writer_urls list is holding all the required urls.