Missing characters using Text.Regex.PCRE to parse web page title - regex

I recently made a website that needs to retrieve talk titles from TED website.
So far, the problem is specific to this talk: Francis Collins: We need better drugs -- now
From the web page source, I get:
<title>Francis Collins: We need better drugs -- now | Video on TED.com</title>
<span id="altHeadline" >Francis Collins: We need better drugs -- now</span>
Now, in ghci, I tried this:
λ> :m +Network.HTTP Text.Regex.PCRE
λ> let uri = "http://www.ted.com/talks/francis_collins_we_need_better_drugs_now.html"
λ> body <- (simpleHTTP $ getRequest uri) >>= getResponseBody
λ> body =~ "<span id=\"altHeadline\" >(.+)</span>" :: [[String]]
[["id=\"altHeadline\" >Francis Collins: We need better drugs -- now</span>\n\t\t</h","s Collins: We need better drugs -- now</span"]]
λ> body =~ "<title>(.+)</title>" :: [[String]]
[["tle>Francis Collins: We need better drugs -- now | Video on TED.com</title>\n<l","ncis Collins: We need better drugs -- now | Video on TED.com</t"]]
Either way, the parsed title misses some characters on the left, and has some unintended characters on the right. It seems to have something to do with the -- in talk title. However,
λ> let body' = "<title>Francis Collins: We need better drugs -- now | Video on TED.com</title>"
λ> body' =~ "<title>(.+)</title>" :: [[String]]
[["<title>Francis Collins: We need better drugs -- now | Video on TED.com</title>","Francis Collins: We need better drugs -- now | Video on TED.com"]]
Luckily, this is not a problem with Text.Regex.Posix.
λ> import qualified Text.Regex.Posix as P
λ> body P.=~ "<title>(.+)</title>" :: [[String]]
[["<title>Francis Collins: We need better drugs -- now | Video on TED.com</title>","Francis Collins: We need better drugs -- now | Video on TED.com"]]

My recommendation would be: don't use a regex for parsing HTML. Use a proper HTML parser instead. Here's an example using the html-conduit parser together with the xml-conduit cursor library (and http-conduit for download).
{-# LANGUAGE OverloadedStrings #-}
import Data.Monoid (mconcat)
import Network.HTTP.Conduit (simpleHttp)
import Text.HTML.DOM (parseLBS)
import Text.XML.Cursor (attributeIs, content, element,
fromDocument, ($//), (&//), (>=>))
main = do
lbs <- simpleHttp "http://www.ted.com/talks/francis_collins_we_need_better_drugs_now.html"
let doc = parseLBS lbs
cursor = fromDocument doc
print $ mconcat $ cursor $// element "title" &// content
print $ mconcat $ cursor $// element "span" >=> attributeIs "id" "altHeadline" &// content
The code is also available as active code on the School of Haskell.

Related

Spacy to Conll format without using Spacy's sentence splitter

This post shows how to get dependencies of a block of text in Conll format with Spacy's taggers. This is the solution posted:
import spacy
nlp_en = spacy.load('en')
doc = nlp_en(u'Bob bought the pizza to Alice')
for sent in doc.sents:
for i, word in enumerate(sent):
if word.head == word:
head_idx = 0
else:
head_idx = word.head.i - sent[0].i + 1
print("%d\t%s\t%s\t%s\t%s\t%s\t%s"%(
i+1, # There's a word.i attr that's position in *doc*
word,
word.lemma_,
word.tag_, # Fine-grained tag
word.ent_type_,
str(head_idx),
word.dep_ # Relation
))
It outputs this block:
1 Bob bob NNP PERSON 2 nsubj
2 bought buy VBD 0 ROOT
3 the the DT 4 det
4 pizza pizza NN 2 dobj
5 to to IN 2 dative
6 Alice alice NNP PERSON 5 pobj
I would like to get the same output WITHOUT using doc.sents.
Indeed, I have my own sentence-splitter. I would like to use it, and then give Spacy one sentence at a time to get POS, NER, and dependencies.
How can I get POS, NER, and dependencies of one sentence in Conll format with Spacy without having to use Spacy's sentence splitter ?
A Document in sPacy is iterable, and in the documentation is states that it iterates over Tokens
| __iter__(...)
| Iterate over `Token` objects, from which the annotations can be
| easily accessed. This is the main way of accessing `Token` objects,
| which are the main way annotations are accessed from Python. If faster-
| than-Python speeds are required, you can instead access the annotations
| as a numpy array, or access the underlying C data directly from Cython.
|
| EXAMPLE:
| >>> for token in doc
Therefore I believe you would just have to make a Document for each of your sentences that are split, then do something like the following:
def printConll(split_sentence_text):
doc = nlp(split_sentence_text)
for i, word in enumerate(doc):
if word.head == word:
head_idx = 0
else:
head_idx = word.head.i - sent[0].i + 1
print("%d\t%s\t%s\t%s\t%s\t%s\t%s"%(
i+1, # There's a word.i attr that's position in *doc*
word,
word.lemma_,
word.tag_, # Fine-grained tag
word.ent_type_,
str(head_idx),
word.dep_ # Relation
))
Of course, following the CoNLL format you would have to print a newline after each sentence.
This post is about a user facing unexpected sentence breaks from using the spacy sentence boundary detection. One of the solutions proposed by the developers at Spacy (as on the post) is to add flexibility to add ones own sentence boundary detection rules. This problem is solved in conjunction with dependency parsing by Spacy, not before it. Therefore, I don't think what you're looking for is supported at all by Spacy at the moment, though it might be in the near future.
#ashu 's answer is partly right: dependency parsing and sentence boundary detection are tightly coupled by design in spaCy. Though there is a simple sentencizer.
https://spacy.io/api/sentencizer
It seems the sentecizer just uses punctuation (not the perfect way). But if such sentencizer exists then you can create a custom one using your rules and it will affect sentence boundaries for sure.

Lemmatizing Italian sentences for frequency counting

I would like to lemmatize some Italian text in order to perform some frequency counting of words and further investigations on the output of this lemmatized content.
I am preferring lemmatizing than stemming because I could extract the word meaning from the context in the sentence (e.g. distinguish between a verb and a noun) and obtain words that exist in the language, rather than roots of those words that don't usually have a meaning.
I found out this library called pattern (pip2 install pattern) that should complement nltk in order to perform lemmatization of the Italian language, however I am not sure the approach below is correct because each word is lemmatized by itself, not in the context of a sentence.
Probably I should give pattern the responsibility to tokenize a sentence (so also annotating each word with the metadata regarding verbs/nouns/adjectives etc), then retrieving the lemmatized word, but I am not able to do this and I am not even sure it is possible at the moment?
Also: in Italian some articles are rendered with an apostrophe so for example "l'appartamento" (in English "the flat") is actually 2 words: "lo" and "appartamento". Right now I am not able to find a way to split these 2 words with a combination of nltk and pattern so then I am not able to count the frequency of the words in the correct way.
import nltk
import string
import pattern
# dictionary of Italian stop-words
it_stop_words = nltk.corpus.stopwords.words('italian')
# Snowball stemmer with rules for the Italian language
ita_stemmer = nltk.stem.snowball.ItalianStemmer()
# the following function is just to get the lemma
# out of the original input word (but right now
# it may be loosing the context about the sentence
# from where the word is coming from i.e.
# the same word could either be a noun/verb/adjective
# according to the context)
def lemmatize_word(input_word):
in_word = input_word#.decode('utf-8')
# print('Something: {}'.format(in_word))
word_it = pattern.it.parse(
in_word,
tokenize=False,
tag=False,
chunk=False,
lemmata=True
)
# print("Input: {} Output: {}".format(in_word, word_it))
the_lemmatized_word = word_it.split()[0][0][4]
# print("Returning: {}".format(the_lemmatized_word))
return the_lemmatized_word
it_string = "Ieri sono andato in due supermercati. Oggi volevo andare all'ippodromo. Stasera mangio la pizza con le verdure."
# 1st tokenize the sentence(s)
word_tokenized_list = nltk.tokenize.word_tokenize(it_string)
print("1) NLTK tokenizer, num words: {} for list: {}".format(len(word_tokenized_list), word_tokenized_list))
# 2nd remove punctuation and everything lower case
word_tokenized_no_punct = [string.lower(x) for x in word_tokenized_list if x not in string.punctuation]
print("2) Clean punctuation, num words: {} for list: {}".format(len(word_tokenized_no_punct), word_tokenized_no_punct))
# 3rd remove stop words (for the Italian language)
word_tokenized_no_punct_no_sw = [x for x in word_tokenized_no_punct if x not in it_stop_words]
print("3) Clean stop-words, num words: {} for list: {}".format(len(word_tokenized_no_punct_no_sw), word_tokenized_no_punct_no_sw))
# 4.1 lemmatize the words
word_tokenize_list_no_punct_lc_no_stowords_lemmatized = [lemmatize_word(x) for x in word_tokenized_no_punct_no_sw]
print("4.1) lemmatizer, num words: {} for list: {}".format(len(word_tokenize_list_no_punct_lc_no_stowords_lemmatized), word_tokenize_list_no_punct_lc_no_stowords_lemmatized))
# 4.2 snowball stemmer for Italian
word_tokenize_list_no_punct_lc_no_stowords_stem = [ita_stemmer.stem(i) for i in word_tokenized_no_punct_no_sw]
print("4.2) stemmer, num words: {} for list: {}".format(len(word_tokenize_list_no_punct_lc_no_stowords_stem), word_tokenize_list_no_punct_lc_no_stowords_stem))
# difference between stemmer and lemmatizer
print(
"For original word(s) '{}' and '{}' the stemmer: '{}' '{}' (count 1 each), the lemmatizer: '{}' '{}' (count 2)"
.format(
word_tokenized_no_punct_no_sw[1],
word_tokenized_no_punct_no_sw[6],
word_tokenize_list_no_punct_lc_no_stowords_stem[1],
word_tokenize_list_no_punct_lc_no_stowords_stem[6],
word_tokenize_list_no_punct_lc_no_stowords_lemmatized[1],
word_tokenize_list_no_punct_lc_no_stowords_lemmatized[1]
)
)
Gives this output:
1) NLTK tokenizer, num words: 20 for list: ['Ieri', 'sono', 'andato', 'in', 'due', 'supermercati', '.', 'Oggi', 'volevo', 'andare', "all'ippodromo", '.', 'Stasera', 'mangio', 'la', 'pizza', 'con', 'le', 'verdure', '.']
2) Clean punctuation, num words: 17 for list: ['ieri', 'sono', 'andato', 'in', 'due', 'supermercati', 'oggi', 'volevo', 'andare', "all'ippodromo", 'stasera', 'mangio', 'la', 'pizza', 'con', 'le', 'verdure']
3) Clean stop-words, num words: 12 for list: ['ieri', 'andato', 'due', 'supermercati', 'oggi', 'volevo', 'andare', "all'ippodromo", 'stasera', 'mangio', 'pizza', 'verdure']
4.1) lemmatizer, num words: 12 for list: [u'ieri', u'andarsene', u'due', u'supermercato', u'oggi', u'volere', u'andare', u"all'ippodromo", u'stasera', u'mangiare', u'pizza', u'verdura']
4.2) stemmer, num words: 12 for list: [u'ier', u'andat', u'due', u'supermerc', u'oggi', u'vol', u'andar', u"all'ippodrom", u'staser', u'mang', u'pizz', u'verdur']
For original word(s) 'andato' and 'andare' the stemmer: 'andat' 'andar' (count 1 each), the lemmatizer: 'andarsene' 'andarsene' (count 2)
How to effectively lemmatize some sentences with pattern using their tokenizer? (assuming lemmas are recognized as nouns/verbs/adjectives etc.)
Is there a python alternative to pattern to use for Italian lemmatization with nltk?
How to split articles that are bound to the next word using apostrophes?
I'll try to answer your question, knowing that I don't know a lot about italian!
1) As far as I know, the main responsibility for removing apostrophe is the tokenizer, and as such the nltk italian tokenizer seems to have failed.
3) A simple thing you can do about it is call the replace method (although you probably will have to use the re package for more complicated pattern), an example:
word_tokenized_no_punct_no_sw_no_apostrophe = [x.split("'") for x in word_tokenized_no_punct_no_sw]
word_tokenized_no_punct_no_sw_no_apostrophe = [y for x in word_tokenized_no_punct_no_sw_no_apostrophe for y in x]
It yields:
['ieri', 'andato', 'due', 'supermercati', 'oggi', 'volevo', 'andare', 'all', 'ippodromo', 'stasera', 'mangio', 'pizza', 'verdure']
2) An alternative to pattern would be treetagger, granted it is not the easiest install of all (you need the python package and the tool itself, however after this part it works on windows and Linux).
A simple example with your example above:
import treetaggerwrapper
from pprint import pprint
it_string = "Ieri sono andato in due supermercati. Oggi volevo andare all'ippodromo. Stasera mangio la pizza con le verdure."
tagger = treetaggerwrapper.TreeTagger(TAGLANG="it")
tags = tagger.tag_text(it_string)
pprint(treetaggerwrapper.make_tags(tags))
The pprint yields:
[Tag(word=u'Ieri', pos=u'ADV', lemma=u'ieri'),
Tag(word=u'sono', pos=u'VER:pres', lemma=u'essere'),
Tag(word=u'andato', pos=u'VER:pper', lemma=u'andare'),
Tag(word=u'in', pos=u'PRE', lemma=u'in'),
Tag(word=u'due', pos=u'ADJ', lemma=u'due'),
Tag(word=u'supermercati', pos=u'NOM', lemma=u'supermercato'),
Tag(word=u'.', pos=u'SENT', lemma=u'.'),
Tag(word=u'Oggi', pos=u'ADV', lemma=u'oggi'),
Tag(word=u'volevo', pos=u'VER:impf', lemma=u'volere'),
Tag(word=u'andare', pos=u'VER:infi', lemma=u'andare'),
Tag(word=u"all'", pos=u'PRE:det', lemma=u'al'),
Tag(word=u'ippodromo', pos=u'NOM', lemma=u'ippodromo'),
Tag(word=u'.', pos=u'SENT', lemma=u'.'),
Tag(word=u'Stasera', pos=u'ADV', lemma=u'stasera'),
Tag(word=u'mangio', pos=u'VER:pres', lemma=u'mangiare'),
Tag(word=u'la', pos=u'DET:def', lemma=u'il'),
Tag(word=u'pizza', pos=u'NOM', lemma=u'pizza'),
Tag(word=u'con', pos=u'PRE', lemma=u'con'),
Tag(word=u'le', pos=u'DET:def', lemma=u'il'),
Tag(word=u'verdure', pos=u'NOM', lemma=u'verdura'),
Tag(word=u'.', pos=u'SENT', lemma=u'.')]
It also tokenized pretty nicely the all'ippodromo to al and ippodromo (which is hopefully correct) under the hood before lemmatizing. Now we just need to apply the removal of stop words and punctuation and it will be fine.
The doc for installing the TreeTaggerWrapper library for python
I know this issue has been solved few years ago, but I am facing the same problem with nltk tokenization and Python 3 in regards to parsing words like all'ippodromo or dall'Italia. So I want to share my experience and give a partial, although late, answer.
The first action/rule that an NLP must take into account is to prepare the corpus. So I discovered that by replacing the ' character with a proper accent ’ by using accurate regex replacing during text parsing (or just a propedeutic replace all at once in basic text editor), then the tokenization works correctly and I am having the proper splitting with just nltk.tokenize.word_tokenize(text)

Find a value based on a matching word and store it

I am new to python, and im just trying to get a feel for the language
I have a file called lion.txt that has this text:
The lion (Panthera leo) is one of the big cats in the genus Panthera and a member of the family Felidae. The commonly used term African lion collectively denotes the several subspecies in Africa. With some males exceeding=250/12 kg (550 lb) in weight,[4].
What I want my program to do is search for the keyword exceeding and write only the value 250 to another file called searched.txt. At very best is it possible to store it as a variable and then print it to another text file?
This is what I have so Far:
import os
import re
os.chdir("C:\Python 2016 Training\lionfolder")
f = open("lion.txt", "r")
w = open("searched.txt", "w")
k = [] #Figured a dictionary would be the best way to deal with this?
for line in f:
if re.match('(.*)exceeding(.*)', line):
w.write(k[1] = "line")
Is what im asking to do even possible with Python?
Thank you in advance
Regards,
Kevin.
Not bad for a first attempt. You're close to a working solution, but missing some critical parts. Try this:
import os
import re
f = open('lion.txt', 'r')
w = open('searched.txt', 'w')
for line in f:
match = re.search('exceeding\=(\d+)', line)
if match:
w.write(match.group(1))
w.close()
f.close()
There are better ways of doing this, but I have tried to stay as close to your original code as possible, so you don't get lost.

Haskell Regex performance

I've been looking at the existing options for regex in Haskell, and I wanted to understand where the gap in performance came from when comparing the various options with each other and especially with a simple call to grep...
I have a relatively small (~ 110M, compared to a usual several 10s of G in most of my use cases) trace file :
$ du radixtracefile
113120 radixtracefile
$ wc -l radixtracefile
1051565 radixtracefile
I first tried to find how many matches of the (arbitrary) pattern .*504.*ll were in there through grep :
$ time grep -nE ".*504.*ll" radixtracefile | wc -l
309
real 0m0.211s
user 0m0.202s
sys 0m0.010s
I looked at Text.Regex.TDFA (version 1.2.1) with Data.ByteString :
import Control.Monad.Loops
import Data.Maybe
import qualified Data.Text as T
import qualified Data.Text.IO as TIO
import Text.Regex.TDFA
import qualified Data.ByteString as B
main = do
f <- B.readFile "radixtracefile"
matches :: [[B.ByteString]] <- f =~~ ".*504.*ll"
mapM_ (putStrLn . show . head) matches
Building and running :
$ ghc -O2 test-TDFA.hs -XScopedTypeVariables
[1 of 1] Compiling Main ( test-TDFA.hs, test-TDFA.o )
Linking test-TDFA ...
$ time ./test-TDFA | wc -l
309
real 0m4.463s
user 0m4.431s
sys 0m0.036s
Then, I looked at Data.Text.ICU.Regex (version 0.7.0.1) with Unicode support:
import Control.Monad.Loops
import qualified Data.Text as T
import qualified Data.Text.IO as TIO
import Data.Text.ICU.Regex
main = do
re <- regex [] $ T.pack ".*504.*ll"
f <- TIO.readFile "radixtracefile"
setText re f
whileM_ (findNext re) $ do
a <- start re 0
putStrLn $ "last match at :"++(show a)
Building and running :
$ ghc -O2 test-ICU.hs
[1 of 1] Compiling Main ( test-ICU.hs, test-ICU.o )
Linking test-ICU ...
$ time ./test-ICU | wc -l
309
real 1m36.407s
user 1m36.090s
sys 0m0.169s
I use ghc version 7.6.3. I haven't had the occasion of testing other Haskell regex options. I knew that I would not get the performance that I had with grep and was more than happy with that, but more or less 20 times slower for the TDFA and ByteString... That is very scary. And I can't really understand why it is what it is, as I naively though that this was a wrapper on a native backend... Am I somehow not using the module correctly ?
(And let's not mention the ICU + Text combo which is going through the roof)
Is there an option that I haven't tested yet that would make me happier ?
EDIT :
Text.Regex.PCRE (version 0.94.4) with Data.ByteString :
import Control.Monad.Loops
import Data.Maybe
import Text.Regex.PCRE
import qualified Data.ByteString as B
main = do
f <- B.readFile "radixtracefile"
matches :: [[B.ByteString]] <- f =~~ ".*504.*ll"
mapM_ (putStrLn . show . head) matches
Building and running :
$ ghc -O2 test-PCRE.hs -XScopedTypeVariables
[1 of 1] Compiling Main ( test-PCRE.hs, test-PCRE.o )
Linking test-PCRE ...
$ time ./test-PCRE | wc -l
309
real 0m1.442s
user 0m1.412s
sys 0m0.031s
Better, but still with a factor of ~7-ish ...
So, after looking at other libraries for a bit, I ended up trying PCRE.Ligth (version 0.4.0.4) :
import Control.Monad
import Text.Regex.PCRE.Light
import qualified Data.ByteString.Char8 as B
main = do
f <- B.readFile "radixtracefile"
let lines = B.split '\n' f
let re = compile (B.pack ".*504.*ll") []
forM_ lines $ \l -> maybe (return ()) print $ match re l []
Here is what I get out of that :
$ ghc -O2 test-PCRELight.hs -XScopedTypeVariables
[1 of 1] Compiling Main ( test-PCRELight.hs, test-PCRELight.o )
Linking test-PCRELight ...
$ time ./test-PCRELight | wc -l
309
real 0m0.832s
user 0m0.803s
sys 0m0.027s
I think this is decent enough for my purposes. I might try to see what happens with the other libs when I manually do the line splitting like I did here, although I doubt it's going to make a big difference.

Scala Regex help on UCI data set

Hi Guys i'm trying to parse some data in http://kdd.ics.uci.edu/databases/20newsgroups/20_newsgroups.tar.gz using scala regex
Heres the text that im trying to process:
val inputData = ""xref: cantaloupe.srv.cs.cmu.edu alt.atheism:51121 soc.motss:139944 rec.scouting:5318
newsgroups: alt.atheism,soc.motss,rec.scouting
path: cantaloupe.srv.cs.cmu.edu!crabapple.srv.cs.cmu.edu!fs7.ece.cmu.edu!europa.eng.gtefsd.com!howland.reston.ans.net!wupost!uunet!newsgate.watson.ibm.com!yktnews.watson.ibm.com!watson!watson.ibm.com!strom
from: strom#watson.ibm.com (rob strom)
subject: re: [soc.motss, et al.] "princeton axes matching funds for boy scouts"
sender: #watson.ibm.com
message-id: <1993apr05.180116.43346#watson.ibm.com>
date: mon, 05 apr 93 18:01:16 gmt
distribution: usa
references: <c47efs.3q47#austin.ibm.com> <1993mar22.033150.17345#cbnewsl.cb.att.com> <n4hy.93apr5120934#harder.ccr-p.ida.org>
organization: ibm research
lines: 15
in article <n4hy.93apr5120934#harder.ccr-p.ida.org>, n4hy#harder.ccr-p.ida.org (bob mcgwier) writes:
|> [1] however, i hate economic terrorism and political correctness
|> worse than i hate this policy.
|> [2] a more effective approach is to stop donating
|> to any organizating that directly or indirectly supports gay rights issues
|> until they end the boycott on funding of scouts.
can somebody reconcile the apparent contradiction between [1] and [2]?
--
rob strom, strom#watson.ibm.com, (914) 784-7641
ibm research, 30 saw mill river road, p.o. box 704, yorktown heights, ny 10598"
Here's the output that i need
in article <n4hy.93apr5120934#harder.ccr-p.ida.org>, n4hy#harder.ccr-p.ida.org (bob mcgwier) writes:
|> [1] however, i hate economic terrorism and political correctness
|> worse than i hate this policy.
|> [2] a more effective approach is to stop donating
|> to any organizating that directly or indirectly supports gay rights issues
|> until they end the boycott on funding of scouts.
can somebody reconcile the apparent contradiction between [1] and [2]?
Here's what i tried:
val docParser = """([\\s\\S]+\\lines: \\d*)([\\s\\S]*\\n\\n)([\\s\\S]*)""".r
val docParser(metadata, content, footer) = inputText
But im getting the following error:
scala.MatchError: [Ljava.lang.String;#62f8fff1 (of class [Ljava.lang.String;)
Online regex builder seems to work though:
Any ideas? :)
I have never programmed in scala before, but from what I can see in http://www.tutorialspoint.com/scala/scala_regular_expressions.htm
you have to escape twice stuff like digits.
So \d would become \\d in scala and so on.