I am using NLTK lib in python to break down each word into tagged elements (i.e. ('London', ''NNP)). However, I cannot figure out how to take this list, and capitalise locations if they are lower case. This is important because london is no longer an 'NNP' and some other locations even become verbs. If anyone knows how to do this efficiently, that would be amazing!
Here is my code:
# returns nature of question with appropriate response text
def chunk_target(self, text, extract_targets):
custom_sent_tokenizer = PunktSentenceTokenizer(text)
tokenized = custom_sent_tokenizer.tokenize(text)
stack = []
for chunk_grammer in extract_targets:
for i in tokenized:
words = nltk.word_tokenize(i)
tagged = nltk.pos_tag(words)
new = []
# This is where i'm trying to turn valid locations into NNP (capitalise)
for w in tagged:
print(w[0])
for line in self.stations:
if w[0].title() in line.split() and len(w[0]) > 2 and w[0].title() not in new:
new.append(w[0].title())
w = w[0].title()
print(new)
print(tagged)
chunkGram = chunk_grammer
chunkParser = nltk.RegexpParser(chunkGram)
chunked = chunkParser.parse(tagged)
for subtree in chunked.subtrees(filter=lambda t: t.label() == 'Chunk'):
stack.append(subtree)
if stack != []:
return stack[0]
return None
What you're looking for is Named Entity Recognition (NER). NLTK does support a named entity function: ne_chunk, which can be used for this purpose. I'll give a demonstration:
from nltk import word_tokenize, pos_tag, ne_chunk
sentence = "In the wake of a string of abuses by New York police officers in the 1990s, Loretta E. Lynch, the top federal prosecutor in Brooklyn, spoke forcefully about the pain of a broken trust that African-Americans felt and said the responsibility for repairing generations of miscommunication and mistrust fell to law enforcement."
# Tokenize str -> List[str]
tok_sent = word_tokenize(sentence)
# Tag List[str] -> List[Tuple[str, str]]
pos_sent = pos_tag(tok_sent)
print(pos_sent)
# Chunk this tagged data
tree_sent = ne_chunk(pos_sent)
# This returns a Tree, which we pretty-print
tree_sent.pprint()
locations = []
# All subtrees at height 2 will be our named entities
for named_entity in tree_sent.subtrees(lambda t: t.height() == 2):
# Extract named entity type and the chunk
ne_type = named_entity.label()
chunk = " ".join([tagged[0] for tagged in named_entity.leaves()])
print(ne_type, chunk)
if ne_type == "GPE":
locations.append(chunk)
print(locations)
This outputs (with my comments added):
# pos_tag output:
[('In', 'IN'), ('the', 'DT'), ('wake', 'NN'), ('of', 'IN'), ('a', 'DT'), ('string', 'NN'), ('of', 'IN'), ('abuses', 'NNS'), ('by', 'IN'), ('New', 'NNP'), ('York', 'NNP'), ('police', 'NN'), ('officers', 'NNS'), ('in', 'IN'), ('the', 'DT'), ('1990s', 'CD'), (',', ','), ('Loretta', 'NNP'), ('E.', 'NNP'), ('Lynch', 'NNP'), (',', ','), ('the', 'DT'), ('top', 'JJ'), ('federal', 'JJ'), ('prosecutor', 'NN'), ('in', 'IN'), ('Brooklyn', 'NNP'), (',', ','), ('spoke', 'VBD'), ('forcefully', 'RB'), ('about', 'IN'), ('the', 'DT'), ('pain', 'NN'), ('of', 'IN'), ('a', 'DT'), ('broken', 'JJ'), ('trust', 'NN'), ('that', 'IN'), ('African-Americans', 'NNP'), ('felt', 'VBD'), ('and', 'CC'), ('said', 'VBD'), ('the', 'DT'), ('responsibility', 'NN'), ('for', 'IN'), ('repairing', 'VBG'), ('generations', 'NNS'), ('of', 'IN'), ('miscommunication', 'NN'), ('and', 'CC'), ('mistrust', 'NN'), ('fell', 'VBD'), ('to', 'TO'), ('law', 'NN'), ('enforcement', 'NN'), ('.', '.')]
# ne_chunk output:
(S
In/IN
the/DT
wake/NN
of/IN
a/DT
string/NN
of/IN
abuses/NNS
by/IN
(GPE New/NNP York/NNP)
police/NN
officers/NNS
in/IN
the/DT
1990s/CD
,/,
(PERSON Loretta/NNP E./NNP Lynch/NNP)
,/,
the/DT
top/JJ
federal/JJ
prosecutor/NN
in/IN
(GPE Brooklyn/NNP)
,/,
spoke/VBD
forcefully/RB
about/IN
the/DT
pain/NN
of/IN
a/DT
broken/JJ
trust/NN
that/IN
African-Americans/NNP
felt/VBD
and/CC
said/VBD
the/DT
responsibility/NN
for/IN
repairing/VBG
generations/NNS
of/IN
miscommunication/NN
and/CC
mistrust/NN
fell/VBD
to/TO
law/NN
enforcement/NN
./.)
# All entities found
GPE New York
PERSON Loretta E. Lynch
GPE Brooklyn
# All GPE (Geo-Political Entity)
['New York', 'Brooklyn']
However, it should be noted that the performance of this ne_chunk seems to fall significantly if we remove all capitalisation from the sentence.
We can perform similar stuff with spaCy:
import spacy
import en_core_web_sm
from pprint import pprint
sentence = "In the wake of a string of abuses by New York police officers in the 1990s, Loretta E. Lynch, the top federal prosecutor in Brooklyn, spoke forcefully about the pain of a broken trust that African-Americans felt and said the responsibility for repairing generations of miscommunication and mistrust fell to law enforcement."
nlp = en_core_web_sm.load()
doc = nlp(sentence)
pprint([(X.text, X.label_) for X in doc.ents])
# Then, we can take only `GPE`:
print([X.text for X in doc.ents if X.label_ == "GPE"])
Which outputs:
[('New York', 'GPE'),
('the 1990s', 'DATE'),
('Loretta E. Lynch', 'PERSON'),
('Brooklyn', 'GPE'),
('African-Americans', 'NORP')]
['New York', 'Brooklyn']
This output (for GPE's) is identical to NLTK's, but the reason I mention spaCy is because unlike NLTK, it also works on fully lower-case sentences. If I lower-case my test sentence, then the output becomes:
[('new york', 'GPE'),
('the 1990s', 'DATE'),
('loretta e. lynch', 'PERSON'),
('brooklyn', 'GPE'),
('african-americans', 'NORP')]
['new york', 'brooklyn']
This allows you to title-case these words in an otherwise lower-case sentence.
Related
there!
I try to output all the possible part-of-speech(pos) of each word in the text. However, I need to print the output as "a list of lists" or "a list of tuples" for the further use.
Anyone can help, many thanks!
import nltk
from nltk.tokenize import word_tokenize
text = "I can answer those question ." # original text
tokenized_text = word_tokenize(text) # word tokenization
wsj = nltk.corpus.treebank.tagged_words()
cfd1 = nltk.ConditionalFreqDist(wsj) # find all possible pos of each word
i = 0
while i< len(tokenized_text):
pos_only = list(cfd1[tokenized_text[i]])
y = pos_only
print(y)
i+=1
my output is
['NNP', 'PRP']
['MD', 'NN']
['NN', 'VB']
['DT']
['NN', 'VBP', 'VB']
['.']
my expected output is
[['NNP', 'PRP'], ['MD', 'NN'], ['NN', 'VB'], ['DT'], ['NN', 'VBP', 'VB'], ['.']]
or
[('NNP', 'PRP'), ('MD', 'NN'), ('NN', 'VB'), ('DT'), ('NN', 'VBP', 'VB'), ('.')]
I think you will need to create an empty list and append elements during iteration. I assumed print(y) outputs ['NNP', 'PRP'] etc. Then you should convert y to a tuple and append it to the list during iteration. This piece of code should do it.
alist = []
i = 0
while i < len(tokenized_text):
pos_only = list(cfd1[tokenized_text[i]])
y = pos_only
alist.append(tuple(y))
i += 1
print(alist)
I am reading in a CSV file with the general schema of
,abv,ibu,id,name,style,brewery_id,ounces
14,0.061,60.0,1979,Bitter Bitch,American Pale Ale (APA),177,12.0
0 , 0.05,, 1436, Pub Beer, American Pale Lager, 408, 12.0
I am running into problems where fields are not existing such as in object 0 where it is lacking an IBU. I would like to be able to insert a value such as 0.0 that would work as a float for values that require floats and an empty string for ones that require strings.
My code is along the lines of
import csv
import numpy as np
def dataset(path, filter_field, filter_value):
with open(path, 'r') as csvfile:
reader = csv.DictReader(csvfile)
if filter_field:
for row in filter(lambda row: row[filter_field]==filter_value, reader):
yield row
def main(path):
data = [(row["ibu"], float(row["ibu"])) for row in dataset(path, "style", "American Pale Lager")]
As of right now my code would throw an error sine there are empty values in the "ibu" column for object 0.
How should one go about solving this problem?
You can do the following:
add a default dictionary input that you can use for missing values
and also to update upon certain conditions such as when ibu is empty
this is your implementation changed to correct for what you need. If I were you I would use pandas ...
import csv, copy
def dataset(path, filter_field, filter_value, default={'brewery_id':-1, 'style': 'unkown style', ' ': -1, 'name': 'unkown name', 'abi':0.0, 'id': -1, 'ounces':-1, 'ibu':0.0}):
with open(path, 'r') as csvfile:
reader = csv.DictReader(csvfile)
for row in reader:
if row is None:
break
if row[filter_field].strip() != filter_value:
continue
default_row = copy.copy(default)
default_row.update(row)
# you might want to add conditions
if default_row["ibu"] == "":
default_row["ibu"] = default["ibu"]
yield default_row
data = [(row["ibu"], float(row["ibu"])) for row in dataset('test.csv', "style", "American Pale Lager")]
print data
>> [(0.0, 0.0)]
Why don't you use
import pandas as pd
df = pd.read_csv(data_file)
The following is the result:
In [13]: df
Out[13]:
Unnamed: 0 abv ibu id name style \
0 14 0.061 60.0 1979 Bitter Bitch American Pale Ale (APA)
1 0 0.050 NaN 1436 Pub Beer American Pale Lager
brewery_id ounces
0 177 12.0
1 408 12.0
Simulating your file with a text string:
In [48]: txt=b""" ,abv,ibu,id,name,style,brewery_id,ounces
...: 14,0.061,60.0,1979,Bitter Bitch,American Pale Ale (APA),177,12.0
...: 0 , 0.05,, 1436, Pub Beer, American Pale Lager, 408, 12.0
...: """
I can load it with numpy genfromtxt.
In [49]: data=np.genfromtxt(txt.splitlines(),delimiter=',',dtype=None,skip_heade
...: r=1,filling_values=0)
In [50]: data
Out[50]:
array([ (14, 0.061, 60., 1979, b'Bitter Bitch', b'American Pale Ale (APA)', 177, 12.),
( 0, 0.05 , 0., 1436, b' Pub Beer', b' American Pale Lager', 408, 12.)],
dtype=[('f0', '<i4'), ('f1', '<f8'), ('f2', '<f8'), ('f3', '<i4'), ('f4', 'S12'), ('f5', 'S23'), ('f6', '<i4'), ('f7', '<f8')])
In [51]:
I had to skip the header line because it is incomplete (a blank for the 1st field). The result is a structured array - a mix of ints, floats and strings (bytestrings in Py3).
After correcting the header line, and using names=True, I get
array([ (14, 0.061, 60., 1979, b'Bitter Bitch', b'American Pale Ale (APA)', 177, 12.),
( 0, 0.05 , 0., 1436, b' Pub Beer', b' American Pale Lager', 408, 12.)],
dtype=[('f0', '<i4'), ('abv', '<f8'), ('ibu', '<f8'), ('id', '<i4'), ('name', 'S12'), ('style', 'S23'), ('brewery_id', '<i4'), ('ounces', '<f8')])
genfromtxt is the most powerful csv reader in numpy. See it's docs for more parameters. The pandas reader is faster and more flexible - but of course produces a data frame, not array.
I have the following code:
from Tkinter import *
import itertools
l1 = [1, 'One', [[1, '1', '2'], [2, '3', '4'], [3, '5', '6']]]
l2 = [2, 'Two', [[1, 'one', 'two'], [2, 'three', 'four'], [3, 'five', 'six']]]
def session(evt,contents):
def setup_cards():
cards = [stack[2] for stack in contents]
setup = [iter(stack) for stack in cards]
return cards, setup
def end():
window.destroy()
def start():
print setup
print cards
pair = next(setup[0])
def flip():
side2cont.set(pair[2])
flipbutton.configure(command=start)
for stack in setup:
try:
for card in cards:
try:
side1cont.set(pair[1])
flipbutton.configure(command=flip)
except StopIteration:
continue
except StopIteration:
pair = next(setup[1])
window = Toplevel()
window.grab_set()
window.title("Session")
card_frame = Frame(window)
card_frame.grid(row=0, column=0, sticky=W, padx=2, pady=2)
button_frame = Frame(window)
button_frame.grid(row=1, column=0, pady=(5,0), padx=2)
side1_frame = LabelFrame(card_frame, text="Side 1")
side1_frame.grid(row=0, column=0)
side1cont = StringVar()
side2cont = StringVar()
side1 = Label(side1_frame, textvariable=side1cont)
side1.grid(row=0, column=0, sticky=W)
side2_frame = LabelFrame(card_frame, text="Side 2")
side2_frame.grid(row=1, column=0)
side2 = Label(side2_frame, textvariable=side2cont)
side2.grid(row=0, column=0, sticky=W)
flipbutton = Button(button_frame, text="Flip", command=start)
flipbutton.grid(row=0, column=2)
finishbutton = Button(button_frame, text="End", command=end)
finishbutton.grid(row=0,column=0, sticky=E)
cards = setup_cards()[0]
setup = setup_cards()[1]
w = Tk()
wbutton = Button(text='toplevel')
wbutton.bind('<Button-1>', lambda evt, args=(l1, l2): session(evt, args))
wbutton.pack()
w.mainloop()
It is piece of my project, I remade it just to the basics so it's easy to understand. In my project, function session accepts files, these are now emulated as lists l1 and l2.
The point where I am struggling is when I hit StopIteration exception. I would like my script to do the following:
1. When iteration reaches end, switch to another iterator (next item in setup list, in this case l2 iterator).
2. If no other iterators are present in setup, reset the iterator ("start over from the beginning").
The code above is the best I was able to come up with, that's why I'm turning to you folks. Thank you (also I'm newbie so I'm still struggling with basics of Python/programming in general).
StopIteration is caught by for and not propagated further. You may want to use for…else.
But your methods of iteration are weird, why not just use regular for loops?
I have been playing with NLTK toolkit. I come across this problem a lot and searched for solution online but nowhere I got a satisfying answer. So I am putting my query here.
Many times NER doesn't tag consecutive NNPs as one NE. I think editing the NER to use RegexpTagger also can improve the NER.
Example:
Input:
Barack Obama is a great person.
Output:
Tree('S', [Tree('PERSON', [('Barack', 'NNP')]), Tree('ORGANIZATION', [('Obama', 'NNP')]), ('is', 'VBZ'), ('a', 'DT'), ('great', 'JJ'), ('person', 'NN'), ('.', '.')])
where as
input:
Former Vice President Dick Cheney told conservative radio host Laura Ingraham that he "was honored" to be compared to Darth Vader while in office.
Output:
Tree('S', [('Former', 'JJ'), ('Vice', 'NNP'), ('President', 'NNP'), Tree('NE', [('Dick', 'NNP'), ('Cheney', 'NNP')]), ('told', 'VBD'), ('conservative', 'JJ'), ('radio', 'NN'), ('host', 'NN'), Tree('NE', [('Laura', 'NNP'), ('Ingraham', 'NNP')]), ('that', 'IN'), ('he', 'PRP'), ('', ''), ('was', 'VBD'), ('honored', 'VBN'), ("''", "''"), ('to', 'TO'), ('be', 'VB'), ('compared', 'VBN'), ('to', 'TO'), Tree('NE', [('Darth', 'NNP'), ('Vader', 'NNP')]), ('while', 'IN'), ('in', 'IN'), ('office', 'NN'), ('.', '.')])
Here Vice/NNP, President/NNP, (Dick/NNP, Cheney/NNP) , is correctly extracted.
So I think if nltk.ne_chunk is used first and then if two consecutive trees are NNP there are high chances that both refers to one entity.
Any suggestion will be really appreciated. I am looking for flaws in my approach.
Thanks.
from nltk import ne_chunk, pos_tag, word_tokenize
from nltk.tree import Tree
def get_continuous_chunks(text):
chunked = ne_chunk(pos_tag(word_tokenize(text)))
prev = None
continuous_chunk = []
current_chunk = []
for i in chunked:
if type(i) == Tree:
current_chunk.append(" ".join([token for token, pos in i.leaves()]))
elif current_chunk:
named_entity = " ".join(current_chunk)
if named_entity not in continuous_chunk:
continuous_chunk.append(named_entity)
current_chunk = []
else:
continue
if continuous_chunk:
named_entity = " ".join(current_chunk)
if named_entity not in continuous_chunk:
continuous_chunk.append(named_entity)
return continuous_chunk
txt = "Barack Obama is a great person."
print get_continuous_chunks(txt)
[out]:
['Barack Obama']
But do note that if the continuous chunk are not supposed to be a single NE, then you would be combining multiple NEs into one. I can't think of such an example off my head but i'm sure it would happen. But if they not continuous, the script above works fine:
>>> txt = "Barack Obama is the husband of Michelle Obama."
>>> get_continuous_chunks(txt)
['Barack Obama', 'Michelle Obama']
There is a bug in #alvas's answer. Fencepost error. Make sure to run that elif check outside of the loop as well so that you don't leave off a NE that occurs at the end of the sentence. So:
from nltk import ne_chunk, pos_tag, word_tokenize
from nltk.tree import Tree
def get_continuous_chunks(text):
chunked = ne_chunk(pos_tag(word_tokenize(text)))
prev = None
continuous_chunk = []
current_chunk = []
for i in chunked:
if type(i) == Tree:
current_chunk.append(" ".join([token for token, pos in i.leaves()]))
elif current_chunk:
named_entity = " ".join(current_chunk)
if named_entity not in continuous_chunk:
continuous_chunk.append(named_entity)
current_chunk = []
else:
continue
if current_chunk:
named_entity = " ".join(current_chunk)
if named_entity not in continuous_chunk:
continuous_chunk.append(named_entity)
current_chunk = []
return continuous_chunk
txt = "Barack Obama is a great person and so is Michelle Obama."
print get_continuous_chunks(txt)
#alvas great answer. It was really helpful. I have tried to capture your solution in a more functional way. Still have to improve on it though.
def conditions(tree_node):
return tree_node.height() == 2
def coninuous_entities(self, input_text, file_handle):
from nltk import ne_chunk, pos_tag, word_tokenize
from nltk.tree import Tree
# Note: Currently, the chunker categorizes only 2 'NNP' together.
docs = input_text.split('\n')
for input_text in docs:
chunked_data = ne_chunk(pos_tag(word_tokenize(input_text)))
child_data = [subtree for subtree in chunked_data.subtrees(filter = self.filter_conditions)]
named_entities = []
for child in child_data:
if type(child) == Tree:
named_entities.append(" ".join([token for token, pos in child.leaves()]))
# Dump all entities to file for now, we will see how to go about that
if file_handle is not None:
file_handle.write('\n'.join(named_entities) + '\n')
return named_entities
Using conditions function one can add many conditions to filter.
I am trying to combine the contents of two lists, in order to later perform processing on the entire data set. I initially looked at the built in insert function, but it inserts as a list, rather than the contents of the list.
I can slice and append the lists, but is there a cleaner / more Pythonic way of doing what I want than this:
array = ['the', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog']
addition = ['quick', 'brown']
array = array[:1] + addition + array[1:]
You can do the following using the slice syntax on the left hand side of an assignment:
>>> array = ['the', 'fox', 'jumped', 'over', 'the', 'lazy', 'dog']
>>> array[1:1] = ['quick', 'brown']
>>> array
['the', 'quick', 'brown', 'fox', 'jumped', 'over', 'the', 'lazy', 'dog']
That's about as Pythonic as it gets!
The extend method of list object does this, but at the end of the original list.
addition.extend(array)
insert(i,j), where i is the index and j is what you want to insert, does not add as a list. Instead it adds as a list item:
array = ['the', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog']
array.insert(1,'brown')
The new array would be:
array = ['the', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog']
Leveraging the splat operator / list unpacking for lists you can do it using
array = ['the', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog']
addition = ['quick', 'brown']
# like this
array2 = ['the', *addition, 'fox', 'jumps', 'over', 'the', 'lazy', 'dog']
# or like this
array = [ *array[:1], *addition, *array[1:]]
print(array)
print(array2)
to get
['the', 'quick', 'brown', 'fox', 'jumped', 'over', 'the', 'lazy', 'dog']
The operator got introduces with PEP 448: Additional Unpacking Generalizations.