Named Entity Recognition with Regular Expression: NLTK - regex

I have been playing with NLTK toolkit. I come across this problem a lot and searched for solution online but nowhere I got a satisfying answer. So I am putting my query here.
Many times NER doesn't tag consecutive NNPs as one NE. I think editing the NER to use RegexpTagger also can improve the NER.
Example:
Input:
Barack Obama is a great person.
Output:
Tree('S', [Tree('PERSON', [('Barack', 'NNP')]), Tree('ORGANIZATION', [('Obama', 'NNP')]), ('is', 'VBZ'), ('a', 'DT'), ('great', 'JJ'), ('person', 'NN'), ('.', '.')])
where as
input:
Former Vice President Dick Cheney told conservative radio host Laura Ingraham that he "was honored" to be compared to Darth Vader while in office.
Output:
Tree('S', [('Former', 'JJ'), ('Vice', 'NNP'), ('President', 'NNP'), Tree('NE', [('Dick', 'NNP'), ('Cheney', 'NNP')]), ('told', 'VBD'), ('conservative', 'JJ'), ('radio', 'NN'), ('host', 'NN'), Tree('NE', [('Laura', 'NNP'), ('Ingraham', 'NNP')]), ('that', 'IN'), ('he', 'PRP'), ('', ''), ('was', 'VBD'), ('honored', 'VBN'), ("''", "''"), ('to', 'TO'), ('be', 'VB'), ('compared', 'VBN'), ('to', 'TO'), Tree('NE', [('Darth', 'NNP'), ('Vader', 'NNP')]), ('while', 'IN'), ('in', 'IN'), ('office', 'NN'), ('.', '.')])
Here Vice/NNP, President/NNP, (Dick/NNP, Cheney/NNP) , is correctly extracted.
So I think if nltk.ne_chunk is used first and then if two consecutive trees are NNP there are high chances that both refers to one entity.
Any suggestion will be really appreciated. I am looking for flaws in my approach.
Thanks.

from nltk import ne_chunk, pos_tag, word_tokenize
from nltk.tree import Tree
def get_continuous_chunks(text):
chunked = ne_chunk(pos_tag(word_tokenize(text)))
prev = None
continuous_chunk = []
current_chunk = []
for i in chunked:
if type(i) == Tree:
current_chunk.append(" ".join([token for token, pos in i.leaves()]))
elif current_chunk:
named_entity = " ".join(current_chunk)
if named_entity not in continuous_chunk:
continuous_chunk.append(named_entity)
current_chunk = []
else:
continue
if continuous_chunk:
named_entity = " ".join(current_chunk)
if named_entity not in continuous_chunk:
continuous_chunk.append(named_entity)
return continuous_chunk
txt = "Barack Obama is a great person."
print get_continuous_chunks(txt)
[out]:
['Barack Obama']
But do note that if the continuous chunk are not supposed to be a single NE, then you would be combining multiple NEs into one. I can't think of such an example off my head but i'm sure it would happen. But if they not continuous, the script above works fine:
>>> txt = "Barack Obama is the husband of Michelle Obama."
>>> get_continuous_chunks(txt)
['Barack Obama', 'Michelle Obama']

There is a bug in #alvas's answer. Fencepost error. Make sure to run that elif check outside of the loop as well so that you don't leave off a NE that occurs at the end of the sentence. So:
from nltk import ne_chunk, pos_tag, word_tokenize
from nltk.tree import Tree
def get_continuous_chunks(text):
chunked = ne_chunk(pos_tag(word_tokenize(text)))
prev = None
continuous_chunk = []
current_chunk = []
for i in chunked:
if type(i) == Tree:
current_chunk.append(" ".join([token for token, pos in i.leaves()]))
elif current_chunk:
named_entity = " ".join(current_chunk)
if named_entity not in continuous_chunk:
continuous_chunk.append(named_entity)
current_chunk = []
else:
continue
if current_chunk:
named_entity = " ".join(current_chunk)
if named_entity not in continuous_chunk:
continuous_chunk.append(named_entity)
current_chunk = []
return continuous_chunk
txt = "Barack Obama is a great person and so is Michelle Obama."
print get_continuous_chunks(txt)

#alvas great answer. It was really helpful. I have tried to capture your solution in a more functional way. Still have to improve on it though.
def conditions(tree_node):
return tree_node.height() == 2
def coninuous_entities(self, input_text, file_handle):
from nltk import ne_chunk, pos_tag, word_tokenize
from nltk.tree import Tree
# Note: Currently, the chunker categorizes only 2 'NNP' together.
docs = input_text.split('\n')
for input_text in docs:
chunked_data = ne_chunk(pos_tag(word_tokenize(input_text)))
child_data = [subtree for subtree in chunked_data.subtrees(filter = self.filter_conditions)]
named_entities = []
for child in child_data:
if type(child) == Tree:
named_entities.append(" ".join([token for token, pos in child.leaves()]))
# Dump all entities to file for now, we will see how to go about that
if file_handle is not None:
file_handle.write('\n'.join(named_entities) + '\n')
return named_entities
Using conditions function one can add many conditions to filter.

Related

How to Capitalize Locations in a List Python

I am using NLTK lib in python to break down each word into tagged elements (i.e. ('London', ''NNP)). However, I cannot figure out how to take this list, and capitalise locations if they are lower case. This is important because london is no longer an 'NNP' and some other locations even become verbs. If anyone knows how to do this efficiently, that would be amazing!
Here is my code:
# returns nature of question with appropriate response text
def chunk_target(self, text, extract_targets):
custom_sent_tokenizer = PunktSentenceTokenizer(text)
tokenized = custom_sent_tokenizer.tokenize(text)
stack = []
for chunk_grammer in extract_targets:
for i in tokenized:
words = nltk.word_tokenize(i)
tagged = nltk.pos_tag(words)
new = []
# This is where i'm trying to turn valid locations into NNP (capitalise)
for w in tagged:
print(w[0])
for line in self.stations:
if w[0].title() in line.split() and len(w[0]) > 2 and w[0].title() not in new:
new.append(w[0].title())
w = w[0].title()
print(new)
print(tagged)
chunkGram = chunk_grammer
chunkParser = nltk.RegexpParser(chunkGram)
chunked = chunkParser.parse(tagged)
for subtree in chunked.subtrees(filter=lambda t: t.label() == 'Chunk'):
stack.append(subtree)
if stack != []:
return stack[0]
return None
What you're looking for is Named Entity Recognition (NER). NLTK does support a named entity function: ne_chunk, which can be used for this purpose. I'll give a demonstration:
from nltk import word_tokenize, pos_tag, ne_chunk
sentence = "In the wake of a string of abuses by New York police officers in the 1990s, Loretta E. Lynch, the top federal prosecutor in Brooklyn, spoke forcefully about the pain of a broken trust that African-Americans felt and said the responsibility for repairing generations of miscommunication and mistrust fell to law enforcement."
# Tokenize str -> List[str]
tok_sent = word_tokenize(sentence)
# Tag List[str] -> List[Tuple[str, str]]
pos_sent = pos_tag(tok_sent)
print(pos_sent)
# Chunk this tagged data
tree_sent = ne_chunk(pos_sent)
# This returns a Tree, which we pretty-print
tree_sent.pprint()
locations = []
# All subtrees at height 2 will be our named entities
for named_entity in tree_sent.subtrees(lambda t: t.height() == 2):
# Extract named entity type and the chunk
ne_type = named_entity.label()
chunk = " ".join([tagged[0] for tagged in named_entity.leaves()])
print(ne_type, chunk)
if ne_type == "GPE":
locations.append(chunk)
print(locations)
This outputs (with my comments added):
# pos_tag output:
[('In', 'IN'), ('the', 'DT'), ('wake', 'NN'), ('of', 'IN'), ('a', 'DT'), ('string', 'NN'), ('of', 'IN'), ('abuses', 'NNS'), ('by', 'IN'), ('New', 'NNP'), ('York', 'NNP'), ('police', 'NN'), ('officers', 'NNS'), ('in', 'IN'), ('the', 'DT'), ('1990s', 'CD'), (',', ','), ('Loretta', 'NNP'), ('E.', 'NNP'), ('Lynch', 'NNP'), (',', ','), ('the', 'DT'), ('top', 'JJ'), ('federal', 'JJ'), ('prosecutor', 'NN'), ('in', 'IN'), ('Brooklyn', 'NNP'), (',', ','), ('spoke', 'VBD'), ('forcefully', 'RB'), ('about', 'IN'), ('the', 'DT'), ('pain', 'NN'), ('of', 'IN'), ('a', 'DT'), ('broken', 'JJ'), ('trust', 'NN'), ('that', 'IN'), ('African-Americans', 'NNP'), ('felt', 'VBD'), ('and', 'CC'), ('said', 'VBD'), ('the', 'DT'), ('responsibility', 'NN'), ('for', 'IN'), ('repairing', 'VBG'), ('generations', 'NNS'), ('of', 'IN'), ('miscommunication', 'NN'), ('and', 'CC'), ('mistrust', 'NN'), ('fell', 'VBD'), ('to', 'TO'), ('law', 'NN'), ('enforcement', 'NN'), ('.', '.')]
# ne_chunk output:
(S
In/IN
the/DT
wake/NN
of/IN
a/DT
string/NN
of/IN
abuses/NNS
by/IN
(GPE New/NNP York/NNP)
police/NN
officers/NNS
in/IN
the/DT
1990s/CD
,/,
(PERSON Loretta/NNP E./NNP Lynch/NNP)
,/,
the/DT
top/JJ
federal/JJ
prosecutor/NN
in/IN
(GPE Brooklyn/NNP)
,/,
spoke/VBD
forcefully/RB
about/IN
the/DT
pain/NN
of/IN
a/DT
broken/JJ
trust/NN
that/IN
African-Americans/NNP
felt/VBD
and/CC
said/VBD
the/DT
responsibility/NN
for/IN
repairing/VBG
generations/NNS
of/IN
miscommunication/NN
and/CC
mistrust/NN
fell/VBD
to/TO
law/NN
enforcement/NN
./.)
# All entities found
GPE New York
PERSON Loretta E. Lynch
GPE Brooklyn
# All GPE (Geo-Political Entity)
['New York', 'Brooklyn']
However, it should be noted that the performance of this ne_chunk seems to fall significantly if we remove all capitalisation from the sentence.
We can perform similar stuff with spaCy:
import spacy
import en_core_web_sm
from pprint import pprint
sentence = "In the wake of a string of abuses by New York police officers in the 1990s, Loretta E. Lynch, the top federal prosecutor in Brooklyn, spoke forcefully about the pain of a broken trust that African-Americans felt and said the responsibility for repairing generations of miscommunication and mistrust fell to law enforcement."
nlp = en_core_web_sm.load()
doc = nlp(sentence)
pprint([(X.text, X.label_) for X in doc.ents])
# Then, we can take only `GPE`:
print([X.text for X in doc.ents if X.label_ == "GPE"])
Which outputs:
[('New York', 'GPE'),
('the 1990s', 'DATE'),
('Loretta E. Lynch', 'PERSON'),
('Brooklyn', 'GPE'),
('African-Americans', 'NORP')]
['New York', 'Brooklyn']
This output (for GPE's) is identical to NLTK's, but the reason I mention spaCy is because unlike NLTK, it also works on fully lower-case sentences. If I lower-case my test sentence, then the output becomes:
[('new york', 'GPE'),
('the 1990s', 'DATE'),
('loretta e. lynch', 'PERSON'),
('brooklyn', 'GPE'),
('african-americans', 'NORP')]
['new york', 'brooklyn']
This allows you to title-case these words in an otherwise lower-case sentence.

store values from loop in a list of lists or a list of tuples

there!
I try to output all the possible part-of-speech(pos) of each word in the text. However, I need to print the output as "a list of lists" or "a list of tuples" for the further use.
Anyone can help, many thanks!
import nltk
from nltk.tokenize import word_tokenize
text = "I can answer those question ." # original text
tokenized_text = word_tokenize(text) # word tokenization
wsj = nltk.corpus.treebank.tagged_words()
cfd1 = nltk.ConditionalFreqDist(wsj) # find all possible pos of each word
i = 0
while i< len(tokenized_text):
pos_only = list(cfd1[tokenized_text[i]])
y = pos_only
print(y)
i+=1
my output is
['NNP', 'PRP']
['MD', 'NN']
['NN', 'VB']
['DT']
['NN', 'VBP', 'VB']
['.']
my expected output is
[['NNP', 'PRP'], ['MD', 'NN'], ['NN', 'VB'], ['DT'], ['NN', 'VBP', 'VB'], ['.']]
or
[('NNP', 'PRP'), ('MD', 'NN'), ('NN', 'VB'), ('DT'), ('NN', 'VBP', 'VB'), ('.')]
I think you will need to create an empty list and append elements during iteration. I assumed print(y) outputs ['NNP', 'PRP'] etc. Then you should convert y to a tuple and append it to the list during iteration. This piece of code should do it.
alist = []
i = 0
while i < len(tokenized_text):
pos_only = list(cfd1[tokenized_text[i]])
y = pos_only
alist.append(tuple(y))
i += 1
print(alist)

How to improve my feature selection for a NB classifier?

I have read that improving feature selection will reduce the training time of my classifier and also improve accuracy but I am not sure how can I reduce the number of features. Should I count them and after select the first 3000 for example ?
This is my code :
def save_object(obj, filename):
with open(filename, 'wb') as output:
pickle.dump(obj,output,pickle.HIGHEST_PROTOCOL)
print "saved"
ujson.dumps({"output" : "obj"})
with open('neg5000.csv','rb') as f:
reader = csv.reader(f)
neg_tweets = list(reader)
print "list ready"
with open('pos5000.csv','rb') as f:
reader = csv.reader(f)
pos_tweets = list(reader)
print "list ready"
tweets = []
for (words, sentiment) in pos_tweets + neg_tweets:
words_filtered = [e.lower() for e in words.split() if len(e) >= 3]
tweets.append((words_filtered, sentiment))
def get_words_in_tweets(tweets):
all_words = []
for (words, sentiment) in tweets:
all_words.extend(words)
return all_words
def get_word_features(wordlist):
wordlist = nltk.FreqDist(wordlist)
word_features = list(wordlist.keys())[:3000]
#word_features = wordlist.keys()
return word_features
def extract_features(document):
document_words = set(document)
features = {}
for word in word_features:
features['contains(%s)' % word] = (word in document_words)
return features
#def extract_features(words):
# return dict([(word, True) for word in words])
word_features = get_word_features(get_words_in_tweets(tweets))
training_set = nltk.classify.apply_features(extract_features, tweets)
save_object(word_features, 'wordf.save')
print 'features done'
print datetime.datetime.now()
classifier = nltk.NaiveBayesClassifier.train(training_set)
print 'training done'
print datetime.datetime.now()
save_object(classifier, 'classifier.save')
tweet = 'I love this car'
print classifier.classify(extract_features(tweet.split()))
There's a number of ways to approach feature selection for the supervised classification problem (which is what Naive Bayes does). I suggest heading over to scikit-learn manual and just trying everything listed there, since the choice of particular method is dependends on the data you have.
The easiest way to do this is to switch to the scikit-learn implementation of Naive Bayes and the use a Pipeline to chain the feature selection and classifier training. See this tutorial for code examples.
Here's a version of your code using scikit-learn with SelectKBest feature selection:
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_selection import SelectPercentile
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
def read_input(path):
with open(path) as handle:
lines = (line.rsplit(",", 1) for line in handle)
return [text for text, label in lines]
# Assuming each line in ``neg5000.csv`` and ``pos5000.csv`` is a
# UTF-8-encoded tweet.
neg_tweets = read_input("neg5000.csv")
pos_tweets = read_input("pos5000.csv")
X = np.append(neg_tweets, pos_tweets)
y = np.append(np.full(len(neg_tweets), -1, dtype=int),
np.full(len(pos_tweets), 1, dtype=int))
p = Pipeline([
("vectorizer", CountVectorizer()),
("selector", SelectPercentile(percentile=20)),
("nb", MultinomialNB())
])
p.fit(X, y)
print(p.predict(["I love this car"]))

How to print out tags in python

If I have a string such as this:
text = "They refuse to permit us."
txt = nltk.word_tokenize(text)
With this if I print POS tags; nltk.pos_tag(txt) I get
[('They','PRP'), ('refuse', 'VBP'), ('to', 'TO'), ('permit', 'VB'), ('us', 'PRP')]
How can I print out only this:
['PRP', 'VBP', 'TO', 'VB', 'PRP']
You got a list of tuples, you should iterate through it to get only the second element of each tuple.
>>> tagged = nltk.pos_tag(txt)
>>> tags = [ e[1] for e in tagged]
>>> tags
['PRP', 'VBP', 'TO', 'VB', 'PRP']
Take a look at Unpacking a list / tuple of pairs into two lists / tuples
>>> from nltk import pos_tag, word_tokenize
>>> text = "They refuse to permit us."
>>> tagged_text = pos_tag(word_tokenize(text))
>>> tokens, pos = zip(*tagged_text)
>>> pos
('PRP', 'VBP', 'TO', 'VB', 'PRP', '.')
Possibly at some point you will find the POS tagger is slow and you will need to do this (see Slow performance of POS tagging. Can I do some kind of pre-warming?):
>>> from nltk import pos_tag, word_tokenize
>>> from nltk.tag import PerceptronTagger
>>> tagger = PerceptronTagger()
>>> text = "They refuse to permit us."
>>> tagged_text = tagger.tag(word_tokenize(text))
>>> tokens, pos = zip(*tagged_text)
>>> pos
('PRP', 'VBP', 'TO', 'VB', 'PRP', '.')
You can iterate like -
print [x[1] for x in nltk.pos_tag(txt)]

Tkinter and displaying iterating list

I have the following code:
from Tkinter import *
import itertools
l1 = [1, 'One', [[1, '1', '2'], [2, '3', '4'], [3, '5', '6']]]
l2 = [2, 'Two', [[1, 'one', 'two'], [2, 'three', 'four'], [3, 'five', 'six']]]
def session(evt,contents):
def setup_cards():
cards = [stack[2] for stack in contents]
setup = [iter(stack) for stack in cards]
return cards, setup
def end():
window.destroy()
def start():
print setup
print cards
pair = next(setup[0])
def flip():
side2cont.set(pair[2])
flipbutton.configure(command=start)
for stack in setup:
try:
for card in cards:
try:
side1cont.set(pair[1])
flipbutton.configure(command=flip)
except StopIteration:
continue
except StopIteration:
pair = next(setup[1])
window = Toplevel()
window.grab_set()
window.title("Session")
card_frame = Frame(window)
card_frame.grid(row=0, column=0, sticky=W, padx=2, pady=2)
button_frame = Frame(window)
button_frame.grid(row=1, column=0, pady=(5,0), padx=2)
side1_frame = LabelFrame(card_frame, text="Side 1")
side1_frame.grid(row=0, column=0)
side1cont = StringVar()
side2cont = StringVar()
side1 = Label(side1_frame, textvariable=side1cont)
side1.grid(row=0, column=0, sticky=W)
side2_frame = LabelFrame(card_frame, text="Side 2")
side2_frame.grid(row=1, column=0)
side2 = Label(side2_frame, textvariable=side2cont)
side2.grid(row=0, column=0, sticky=W)
flipbutton = Button(button_frame, text="Flip", command=start)
flipbutton.grid(row=0, column=2)
finishbutton = Button(button_frame, text="End", command=end)
finishbutton.grid(row=0,column=0, sticky=E)
cards = setup_cards()[0]
setup = setup_cards()[1]
w = Tk()
wbutton = Button(text='toplevel')
wbutton.bind('<Button-1>', lambda evt, args=(l1, l2): session(evt, args))
wbutton.pack()
w.mainloop()
It is piece of my project, I remade it just to the basics so it's easy to understand. In my project, function session accepts files, these are now emulated as lists l1 and l2.
The point where I am struggling is when I hit StopIteration exception. I would like my script to do the following:
1. When iteration reaches end, switch to another iterator (next item in setup list, in this case l2 iterator).
2. If no other iterators are present in setup, reset the iterator ("start over from the beginning").
The code above is the best I was able to come up with, that's why I'm turning to you folks. Thank you (also I'm newbie so I'm still struggling with basics of Python/programming in general).
StopIteration is caught by for and not propagated further. You may want to use for…else.
But your methods of iteration are weird, why not just use regular for loops?