I have been trying to find the frequency distribution of nouns in a given sentence. If I do this:
text = "This ball is blue, small and extraordinary. Like no other ball."
text=text.lower()
token_text= nltk.word_tokenize(text)
tagged_sent = nltk.pos_tag(token_text)
nouns= []
for word,pos in tagged_sent:
if pos in ['NN',"NNP","NNS"]:
nouns.append(word)
freq_nouns=nltk.FreqDist(nouns)
print freq_nouns
It considers "ball" and "ball." as separate words. So I went ahead and tokenized the sentence before tokenizing the words:
text = "This ball is blue, small and extraordinary. Like no other ball."
text=text.lower()
sentences = nltk.sent_tokenize(text)
words = [nltk.word_tokenize(sent)for sent in sentences]
tagged_sent = [nltk.pos_tag(sent)for sent in words]
nouns= []
for word,pos in tagged_sent:
if pos in ['NN',"NNP","NNS"]:
nouns.append(word)
freq_nouns=nltk.FreqDist(nouns)
print freq_nouns
It gives the following error:
Traceback (most recent call last):
File "C:\beautifulsoup4-4.3.2\Trial.py", line 19, in <module>
for word,pos in tagged_sent:
ValueError: too many values to unpack
What am I doing wrong? Please help.
You were so close!
In this case, you changed your tagged_sent from a list of tuples to a list of lists of tuples because of your list comprehension tagged_sent = [nltk.pos_tag(sent)for sent in words].
Here's some things you can do to discover what type of objects you have:
>>> type(tagged_sent), len(tagged_sent)
(<type 'list'>, 2)
This shows you that you have a list; in this case of 2 sentences. You can further inspect one of those sentences like this:
>>> type(tagged_sent[0]), len(tagged_sent[0])
(<type 'list'>, 9)
You can see that the first sentence is another list, containing 9 items. Well, what does one of those items look like? Well, lets look at the first item of the first list:
>>> tagged_sent[0][0]
('this', 'DT')
If your curious to see the entire object, which I frequently am, you can ask the pprint (pretty-print) module to make it nicer to look at like this:
>>> from pprint import pprint
>>> pprint(tagged_sent)
[[('this', 'DT'),
('ball', 'NN'),
('is', 'VBZ'),
('blue', 'JJ'),
(',', ','),
('small', 'JJ'),
('and', 'CC'),
('extraordinary', 'JJ'),
('.', '.')],
[('like', 'IN'), ('no', 'DT'), ('other', 'JJ'), ('ball', 'NN'), ('.', '.')]]
So, the long answer is your code needs to iterate over the new second layer of lists, like this:
nouns= []
for sentence in tagged_sent:
for word,pos in sentence:
if pos in ['NN',"NNP","NNS"]:
nouns.append(word)
Of course, this just returns a non-unique list of items, which look like this:
>>> nouns
['ball', 'ball']
You can unique-ify this list in many different ways, but you can quickly by using the set() data structure, like so:
unique_nouns = list(set(nouns))
>>> print unique_nouns
['ball']
For an examination of other ways you can unique-ify a list of items, see the slightly older but extremely useful: http://www.peterbe.com/plog/uniqifiers-benchmark
Related
I am training word2vec on my own text-corpus using mikolov's implementation from here. Not all unique words from the corpus get a vector even though I have set the min-count to 1. Are there any parameters I may have missed, that might be the reason not all unique words get a vector? What else might be the reason?
To test word2vecs behavior I have written the following script providing a text file with 20058 sentences and 278896 words (all words and punctuation are space separated and there is one sentence per line).
import subprocess
def get_w2v_vocab(path_embs):
vocab = set()
with open(path_embs, 'r', encoding='utf8') as f:
next(f)
for line in f:
word = line.split(' ')[0]
vocab.add(word)
return vocab - {'</s>'}
def train(path_corpus, path_embs):
subprocess.call(["./word2vec", "-threads", "6", "-train", path_corpus,
"-output", path_embs, "-min-count", "1"])
def get_unique_words_in_corpus(path_corpus):
vocab = []
with open(path_corpus, 'r', encoding='utf8') as f:
for line in f:
vocab.extend(line.strip('\n').split(' '))
return set(vocab)
def check_equality(expected, actual):
if not expected == actual:
diff = len(expected - actual)
raise Exception('Not equal! Vocab expected: {}, Vocab actual: {}, Diff: {}'.format(len(expected), len(actual), diff))
print('Expected vocab and actual vocab are equal.')
def main():
path_corpus = 'test_corpus2.txt'
path_embs = 'embeddings.vec'
vocab_expected = get_unique_words_in_corpus(path_corpus)
train(path_corpus, path_embs)
vocab_actual = get_w2v_vocab(path_embs)
check_equality(vocab_expected, vocab_actual)
if __name__ == '__main__':
main()
This script gives me the following output:
Starting training using file test_corpus2.txt
Vocab size: 33651
Words in train file: 298954
Alpha: 0.000048 Progress: 99.97% Words/thread/sec: 388.16k Traceback (most recent call last):
File "test_w2v_behaviour.py", line 44, in <module>
main()
File "test_w2v_behaviour.py", line 40, in main
check_equality(vocab_expected, vocab_actual)
File "test_w2v_behaviour.py", line 29, in check_equality
raise Exception('Not equal! Vocab expected: {}, Vocab actual: {}, Diff: {}'.format(len(expected), len(actual), diff))
Exception: Not equal! Vocab expected: 42116, Vocab actual: 33650, Diff: 17316
As long as you're using Python, you might want to use the Word2Vec implementation in the gensim package. It does everything the original Mikolov/Googleword2vec.c does, and more, and is usually performance-competitive.
In particular, it won't have any issues with UTF-8 encoding – while I'm not sure the Mikolov/Google word2vec.c handles UTF-8 correctly. And, that may be a source of your discrepancy.
If you need to get to the bottom of your discrepancy, I would suggest:
have your get_unique_words_in_corpus() also tally/report the total number of non-unique words its tokenization creates. If that's not the same as the 298954 reported by word2vec.c, then the two processes are clearly not working from the same baseline understanding of what 'words' are in the source file.
find some words, or at least one representative word, that your token-count expects to be in the final model, and isn't. Review those for any common characteristic – including in context in the file. That will probably reveal why the two tallies differ.
Again, I suspect something UTF-8 related, or perhaps related to other implementation-limits in word2vec.c (such as a maximum word-lenght) that are not mirrored in your Python-based word tallies.
You could use FastText instead of Word2Vec. FastText is able to embed out-of-vocabulary words by looking at subword information (character ngrams). Gensim also has a FastText implementation, which is very easy to use:
from gensim.models import FastText as ft
model = ft(sentences=training_data,)
word = 'blablabla' # can be out of vocabulary
embedded_word = model[word] # fetches the word embedding
See https://stackoverflow.com/a/54709303/3275464
I have this column of numbers from a txt file that I want to append into a list:
18.0
13.0
10.0
12.0
8.0
my code for placing all these numbers into a list is
last_number_lis = []
for numbers_to_put_in in (path/to/txt):
last_number_lis.append(float(last_number))
print last_number_lis
I want the list to look like
[18.0,13.0,10.0,12.0,8.0]
but instead, when running the code, it shows
[18.0]
[13.0]
[10.0]
[12.0]
[8.0]
Is there any way that all the number can be in one line. Later on, I would like to add all the numbers up. Thanks for your help!!
you can append a list just like :
>>> list=[]
>>> list.append(18.0)
>>> list.append(13.0)
>>> list.append(10.0)
>>> list
[18.0, 13.0, 10.0]
but depend where your number are coming from ...
for example with input in terminal :
>>> list=[]
>>> t=input("type a number to append the list : ")
type a number to append the list : 12.45
>>> list.append(float(t))
>>> t=input("type a number to append the list : ")
type a number to append the list : 15.098
>>> list.append(float(t))
>>> list
[12.45, 15.098]
or reading from file :
>>> list=[]
>>> with open('test.txt', 'r') as infile:
... for i in infile:
... list.append(float(i))
...
>>> list
[13.189, 18.8, 15.156, 11.0]
If it is from a .txt file you would have to do the readline() method,
You could do a for loop and loop through the list of numbers (you never know how many numbers you may be given and might as well let the loop handle it,
with open(file_name) as f:
elemts = f.readlines()
elemts = [x.strip() for x in content]
and then you'd want to loop through the file and add the elements in the list
last_number_list = []
for last_number in elements:
last_number_list.append(float(last_number))
print last_number_list
A slightly less compact but easy to read approach is
num_list = []
f = open('file.txt', 'r') # open in read mode 'r'
lines = f.readlines() # read all lines in file
f.close() # safe to close file now
for line in lines:
num_list.append(float(line.strip()))
print num_list
#!/usr/bin/env python2.7
import vobject
abinfile='/foo/bar/dir/infile.vcf' #ab stands for address book
aboutfile='/foo/bar/dir/outfile.vcf'
def eliminate_vcard_duplicates (abinfile, aboutfile):
#we first convert the Adrees Book IN FILE into a list
with open(abinfile) as source_file:
ablist = list(vobject.readComponents(source_file))
#then add each vcard from that list in a new list unless it's already there
ablist_norepeats=[]
ablist_norepeats.append(ablist[0])
for i in range(1, len(ablist)):
jay=len(ablist_norepeats)
for j in reversed(range(0, jay)): #we do reversed because usually cards have duplicates nearby
if ablist_norepeats[j].serialize() == ablist[i].serialize():
break
else:
jay += -1
if jay == 0:
ablist_norepeats.append(ablist[i])
#and finally write the singularized list to the Adrees Book OUT FILE
with open(aboutfile, 'w') as destination_file:
for j in range(0, len(ablist_norepeats)):
destination_file.write(ablist_norepeats[j].serialize)
eliminate_vcard_duplicates(abinfile, aboutfile)
The above code works and creates a new file where there are no exact duplicates (duplicates with identical singularizations). I know the code has some efficiency issues: it's n square, when it could be n*log n; we could serialize each vacard only once; inefficient use of for etc. Here I wanted to provide a short code to illustrate one of the issues I don't know how to solve.
The issue that I'm not sure how to solve elegantly is this one: If some of the fields in the cards are scrambled it will not detect they are equal. Is there a way to detect such duplicates either with vobject, re, or another approach?
The file contents used in the test, with four equal vcards (phones scrambled messes up code - not email scrambled thought), is this one:
BEGIN:VCARD
VERSION:3.0
FN:Foo_bar1
N:;Foo_bar1;;;
EMAIL;TYPE=INTERNET:foobar1#foo.bar.com
TEL;TYPE=CELL:123456789
TEL;TYPE=CELL:987654321
END:VCARD
BEGIN:VCARD
VERSION:3.0
FN:Foo_bar1
N:;Foo_bar1;;;
EMAIL;TYPE=INTERNET:foobar1#foo.bar.com
TEL;TYPE=CELL:123456789
TEL;TYPE=CELL:987654321
END:VCARD
BEGIN:VCARD
VERSION:3.0
FN:Foo_bar1
N:;Foo_bar1;;;
TEL;TYPE=CELL:123456789
TEL;TYPE=CELL:987654321
EMAIL;TYPE=INTERNET:foobar1#foo.bar.com
END:VCARD
BEGIN:VCARD
VERSION:3.0
FN:Foo_bar1
N:;Foo_bar1;;;
TEL;TYPE=CELL:987654321
TEL;TYPE=CELL:123456789
EMAIL;TYPE=INTERNET:foobar1#foo.bar.com
END:VCARD
The above code will not detect that the four are all the same because the last one has the phone numbers scrambled.
As bonus points, if someone has a faster algorithm it would be great if it can be shared. The above one takes days on a 30.000 Vcard file...
One thing you might have noticed is that if you call the
.serialize() method then EMAIL is sorted before FN. But
unfortunately the telefonenumbers are not sorted. If they were, you
could add the serialized individual components to a set, and let the
unique hashes sort out the multiple occurences.
If you investigate what you get from the generator
vobject.readComponents() (e.g. using type()), you'll see that that
is a Component from the module vobject.base, and using dir() on
an instance you see a method getSortedChildren(). If you look that
up in the source, you'll find:
def getSortedChildren(self):
return [obj for k in self.sortChildKeys() for obj in self.contents[k]]
and sortChildKeys() a directly above that:
def sortChildKeys(self):
try:
first = [s for s in self.behavior.sortFirst if s in self.contents]
except Exception:
first = []
return first + sorted(k for k in self.contents.keys() if k not in first)
calling sortChildKeys() on your example instances gives ['version',
'email', 'fn', 'n', 'tel'], which leads to two conclusions:
sortFirst causes version to be at the front
for obj in self.contents[k] is not sorted therefore your TEL entries are not sorted.
The solution seems to be that you redefine getSortedChildren() to:
return [obj for k in self.sortChildKeys() for obj in sorted(self.contents[k])]
but that leads to:
TypeError: '<' not supported between instances of 'ContentLine' and 'ContentLine'
so you need to provide some basic comparison operations for
ContentLine which is also defined in vobject.base as well:
import vobject
from vobject.base import Component, ContentLine
def gsc(self):
return [obj for k in self.sortChildKeys() for obj in sorted(self.contents[k])]
Component.getSortedChildren = gsc
def ltContentLine(self, other):
return str(self) < str(other)
def eqContentLine(self, other):
return str(self) == str(other)
ContentLine.__lt__ = ltContentLine
ContentLine.__eq__ = eqContentLine
addresses = set()
with open('infile.vcf') as fp:
for vcard in vobject.readComponents(fp):
# print(type(vcard))
# print(dir(vcard))
# print(vcard.sortChildKeys())
# print(vcard.contents.keys())
addresses.add(vcard.serialize())
with open('outfile.vcf', 'w') as fp:
for a in addresses:
fp.write(a)
# and check
with open('outfile.vcf') as fp:
print(fp.read(), end="")
which gives:
BEGIN:VCARD
VERSION:3.0
EMAIL;TYPE=INTERNET:foobar1#foo.bar.com
FN:Foo_bar1
N:;Foo_bar1;;;
TEL;TYPE=CELL:123456789
TEL;TYPE=CELL:987654321
END:VCARD
The following is a faster code (about three orders of magnitude) but still does only remove exact duplicates...
#!/usr/bin/env python2.7
import vobject
import datetime
abinfile='/foo/bar/dir/infile.vcf' #ab stands for address book
aboutfile='/foo/bar/dir/outfile.vcf'
def eliminate_vcard_duplicatesv2(abinfile, aboutfile):
#we first convert the Adrees Book IN FILE into a list
ablist=[]
with open(abinfile) as source_file:
ablist = list(vobject.readComponents(source_file))
#we then serialize the list to expedite comparison process
ablist_serial=[]
for i in range(0, len(ablist)):
ablist_serial.append(ablist[i].serialize())
#then add each unique vcard's position from that list in a new list unless it's already there
ablist_singletons=[]
duplicates=0
for i in range(1, len(ablist_serial)):
if i % 1000 == 0:
print "COMPUTED CARD:", i, "Number of duplicates: ", duplicates, "Current time:", datetime.datetime.now().time()
jay=len(ablist_singletons)
for j in reversed(range(0, jay)): #we do reversed because usually cards have duplicates nearby
if ablist_serial[ablist_singletons[j]] == ablist_serial[i]:
duplicates += 1
break
else:
jay += -1
if jay == 0:
ablist_singletons.append(i)
print "Length of Original Vcard File: ", len(ablist)
print "Length of Singleton Vcard File: ", len(ablist_singletons)
print "Generating Singleton Vcard file and storing it in: ", aboutfile
#and finally write the singularized list to the Adrees Book OUT FILE
with open(aboutfile, 'w') as destination_file:
for k in range(0, len(ablist_singletons)):
destination_file.write(ablist_serial[ablist_singletons[k]])
eliminate_vcard_duplicatesv2(abinfile, aboutfile)
A variation on Anthon's answer, using class decorators.
import vobject
from vobject.base import Component, ContentLine
def sortedContents(cls):
def getSortedChildren(self):
return [obj for k in self.sortChildKeys() for obj in sorted(self.contents[k])]
cls.getSortedChildren = getSortedChildren
return cls
def sortableContent(cls):
def __lt__(self, other):
return str(self) < str(other)
def __eq__(self, other):
return str(self) == str(other)
cls.__lt__ = __lt__
cls.__eq__ = __eq__
return cls
Component = sortedContents(Component)
ContentLine = sortableContent(ContentLine)
addresses = set()
with open('infile.vcf') as infile:
for vcard in vobject.readComponents(infile):
addresses.add(vcard.serialize())
with open('outfile.vcf', 'wb') as outfile:
for address in addresses:
outfile.write(bytes(address, 'UTF-8'))
I want to get not only result of RegexpParser, but also the index of the result.
For example the start index of the word and the end index of the word.
import nltk
from nltk import word_tokenize, pos_tag
text = word_tokenize("6 ACCESSKEY attribute can be used to specify many 6.0 shortcut key 6.0")
tag = pos_tag(text)
print tag
# grammar = "NP: {<DT>?<JJ>*<NN|NNS|NNP|NNPS>}"
grammar2 = """Triple: {<CD>*<DT>?<CD>*<JJ>*<CD>*<VBD|VBG>*<CD>*<NN.*>+<CD>*<MD>*<VB.*>+<JJ>?<RB>?<CD>*<DT>?<NN.*>*<IN*|TO*>?<DT>?<CD>*<JJ>*<CD>*<VBD|VBG>*<CD>*<NN.*>+<CD>*}
Triple: {<CD>*<DT>?<CD>*<JJ>*<CD>*<VBD|VBG>*<CD>*<NN.*>+<CD>*<MD>*<VB.*>+<JJ>?<RB>?<CD>*<DT>?<NN.*>*<TO>?<VB><DT>?<CD>*<JJ>*<CD>*<VBD|VBG>*<CD>*<NN.*>+<CD>*}
"""
grammar = """
NP: {<CD>*<DT>?<CD>*<JJ>*<CD>*<VBD|VBG>*<CD>*<NN.*>+<CD>*}
VP: {<VB.*>+<JJ>*<RB>*<JJ>*<VB.*>?<DT>?<NN|NP>?<IN*|TO*>?}
"""
cp = nltk.RegexpParser(grammar)
result = cp.parse(tag)
print(result)
result.draw()
Since you give the parser tokenised text, there is no way it can guess the original offsets (how could it know how much space was between the tokens).
But, fortunately, the parse() method accepts additional info, which is simply passed on to the output.
In your example, the input (you saved it in the badly named variable tag) looks like this:
[('6', 'CD'),
('ACCESSKEY', 'NNP'),
('attribute', 'NN'),
...
If you manage to change it to
[('6', 'CD', 0, 1),
('ACCESSKEY', 'NNP', 2, 11),
('attribute', 'NN', 12, 21),
...
and feed this to the parser, then the offsets will be included in the parse tree:
Tree('S',
[Tree('NP', [('6', 'CD', 0, 1),
('ACCESSKEY', 'NNP', 2, 11),
('attribute', 'NN', 12, 21)]),
...
How do you get the offsets into the tagged sequence?
Well, I will leave this as a programming exercise to you.
Hint: Look for the span_tokenize() method of the word tokenisers.
I'm trying to loop through a number of text documents and create a feature set by recording :
position list in text
Part of speech of keyphrase
Length of each keyphrase (number of words in it)
Frequency of each keyphrase
Code snippet of extraxting features :
#Take list of Keywords
keyword_list = [line.split(':')[1].lower().strip() for line in keywords.splitlines() if ':' in line ]
#Position
position_list = [ [m.start()/float(len(document)) for m in re.finditer(re.escape(kw),document,flags=re.IGNORECASE)] for kw in keyword_list]
#Part of Speech
pos_list = []
for key in keyword_list:
pos_list.append([pos for w,pos in nltk.pos_tag(nltk.word_tokenize(key))])
#Length of each keyword
len_list = [ len(k.split(' ')) for k in keyword_list]
#Text Frequency
freq_list = [ len(pos)/float(len(document)) for pos in position_list]
target.extend(keyword_list)
for i in range(0,len(keyword_list)):
data.append([position_list[i],pos_list[i],len_list[i],freq_list[i]])
Where
target : list of keywords
data : list of features
I passed this through a classifier :
from sklearn.cross_validation import train_test_split
X_train,X_test,y_train,y_test = train_test_split(data,target,test_size=0.25,random_state = 42)
import numpy as np
X_train = np.array(X_train)
y_train = np.array(y_train)
from sklearn import svm
cls = svm.SVC(gamma=0.001,C=100) # Parameter values Matter!
cls.fit(X_train,y_train)
predictions = cls.predict(X_test)
But I get an error :
Traceback (most recent call last):
File "supervised_3.py", line 113, in <module>
cls.fit(X_train,y_train)
File "/Library/Python/2.7/site-packages/sklearn/svm/base.py", line 150, in fit
X = check_array(X, accept_sparse='csr', dtype=np.float64, order='C')
File "/Library/Python/2.7/site-packages/sklearn/utils/validation.py", line 373, in check_array
array = np.array(array, dtype=dtype, order=order, copy=copy)
ValueError: setting an array element with a sequence
So, I removed all the list items by changing
data.append([position_list[i],pos_list[i],len_list[i],freq_list[i]])
to
data.append([len_list[i],freq_list[i]])
It worked.
But I need to include position_list and pos_list
I thought it wasn't working because these 2 are lists. So, I tried converting them to arrays :
data.append([np.array(position_list[i]),np.array(pos_list[i]),len_list[i],freq_list[i]])
but I still get the same error.
In the last for loop of the feature extraction code you are trying to append to data a list of four elements, namely position_list[i], pos_list[i], len_list[i], freq_list[i]. The problem is that the first two elements are lists themselves, but individual features have to be escalars (this is why the issue is not solved by converting the sublists to numpy arrays). Each of them requires a different workaround:
position_list[i]
This is a list of float numbers. You could replace this list by some statistics computed from it, for example the mean and the standard deviation.
pos_list[i]
This is a list of tags extracted from the list of tuples of the form (token, tag)* yielded by nltk.pos_tag. The tags (which are strings) can be converted into numbers in a straightforward way by counting their number of occurrences. To keep things simple, I will just add the frequency of 'NN' and 'NNS' tags**.
To get your code working you just need to change the last for loop to:
for i in range(0, len(keyword_list)):
positions_i = position_list[i]
tags_i = pos_list[i]
len_tags_i = float(len(tags_i))
m = np.mean(positions_i)
s = np.std(positions_i)
nn = tags_i.count('NN')/len_tags_i
nns = tags_i.count('NNS')/len_tags_i
data.append([m, s, nn, nns, len_list[i], freq_list[i]])
By doing so the resulting feature vector becomes 6-dimensional. Needless to say, you could use a higher or lower number of statistics and/or tag frequencies, or even a different tagset.
* The identifiers w,pos you use in the for loop that creates pos_list are a bit misleading.
** You could utilize a collections.Counter to count the number of occurrences of each tag more efficiently.