I want to get not only result of RegexpParser, but also the index of the result.
For example the start index of the word and the end index of the word.
import nltk
from nltk import word_tokenize, pos_tag
text = word_tokenize("6 ACCESSKEY attribute can be used to specify many 6.0 shortcut key 6.0")
tag = pos_tag(text)
print tag
# grammar = "NP: {<DT>?<JJ>*<NN|NNS|NNP|NNPS>}"
grammar2 = """Triple: {<CD>*<DT>?<CD>*<JJ>*<CD>*<VBD|VBG>*<CD>*<NN.*>+<CD>*<MD>*<VB.*>+<JJ>?<RB>?<CD>*<DT>?<NN.*>*<IN*|TO*>?<DT>?<CD>*<JJ>*<CD>*<VBD|VBG>*<CD>*<NN.*>+<CD>*}
Triple: {<CD>*<DT>?<CD>*<JJ>*<CD>*<VBD|VBG>*<CD>*<NN.*>+<CD>*<MD>*<VB.*>+<JJ>?<RB>?<CD>*<DT>?<NN.*>*<TO>?<VB><DT>?<CD>*<JJ>*<CD>*<VBD|VBG>*<CD>*<NN.*>+<CD>*}
"""
grammar = """
NP: {<CD>*<DT>?<CD>*<JJ>*<CD>*<VBD|VBG>*<CD>*<NN.*>+<CD>*}
VP: {<VB.*>+<JJ>*<RB>*<JJ>*<VB.*>?<DT>?<NN|NP>?<IN*|TO*>?}
"""
cp = nltk.RegexpParser(grammar)
result = cp.parse(tag)
print(result)
result.draw()
Since you give the parser tokenised text, there is no way it can guess the original offsets (how could it know how much space was between the tokens).
But, fortunately, the parse() method accepts additional info, which is simply passed on to the output.
In your example, the input (you saved it in the badly named variable tag) looks like this:
[('6', 'CD'),
('ACCESSKEY', 'NNP'),
('attribute', 'NN'),
...
If you manage to change it to
[('6', 'CD', 0, 1),
('ACCESSKEY', 'NNP', 2, 11),
('attribute', 'NN', 12, 21),
...
and feed this to the parser, then the offsets will be included in the parse tree:
Tree('S',
[Tree('NP', [('6', 'CD', 0, 1),
('ACCESSKEY', 'NNP', 2, 11),
('attribute', 'NN', 12, 21)]),
...
How do you get the offsets into the tagged sequence?
Well, I will leave this as a programming exercise to you.
Hint: Look for the span_tokenize() method of the word tokenisers.
Related
I am trying to identify dates from a column containing text entries and output the dates to a text file. However, my code is not returning any output. I can't seem to figure out what I did wrong in my code. I'd appreciate some help on this.
My Code:
import csv
from dateutil.parser import parse
with open('file1.txt', 'r') as f_input, open('file2.txt', 'w') as f_output:
csv_input = csv.reader(f_input)
csv_output = csv.writer(f_output)
for row in csv_input:
x = str(row[3])
def is_date(x):
try:
parse(x)
csv_output.writerow([row[0], row[1], row[2], row[3], row[4]])
# no return value in case of success
except ValueError:
return False
is_date(x)
Guessing somewhat you input like e.g.:
1,2,3, This is me on march first of 2018 at 2:15 PM, 2015
3,4,5, She was born at 12pm on 9/11/1980, 2015
a version of what you want could be
from dateutil.parser import parse
with open("input.txt", 'r') as inFilePntr, open("output.txt", 'w') as outFilePntr:
for line in inFilePntr:
clmns = line.split(',')
clmns[3] = parse( clmns[3], fuzzy_with_tokens=True )[0].strftime("%Y-%m-%d %H:%M:%S")
outFilePntr.write( ', '.join(clmns) )
Note, as you do not touch the other columns, I just leave them as text. Hence, no need for csv. You never did anything with the return value of parse. I use the fuzzy token, as my column three has the date somewhat hidden in other text. The returned datetime object is transformed into a string of my liking (see here) and inserted in column three, replacing the old value.
I recombine the strings with comma separation again an write it into output.txt, which looks like:
1, 2, 3, 2018-03-01 14:15:00, 2015
3, 4, 5, 1980-09-11 12:00:00, 2015
I have .txt file without specific separators and to parse it, I need to count character by character to know where starts and ends a column. To do so, I constructed a Python dictionary where the keys are the column names and the values are the number of characters that takes each column:
headers = {first_col: 3, second_col: 5, third_col: 2, ... nth_col: n_chars}
Having that in mind, I know that the three first columns of the following line in the .txt file
ABC123-3YN0000000001203ABC123*TESTINGLINE
first_col: ABC
second_col: 123-3
third_col: YN
I want to know if there is any pandas function that helps me to parse this .txt taking into account this particular condition and (if possible) using my headers dictionary.
Using a dictionary is dangerous because the order is not guaranteed. Meaning, if you picked third_col first, you've thrown of your entire scheme. You can fix this by using lists. From there, you can use pd.read_fwf to read a fixed with formatted text file.
Solution
names = ['first_col', 'second_col', 'third_col']
widths = [3, 5, 2]
pd.read_fwf(
'myfile.txt',
widths=widths,
names=names
)
first_col second_col third_col
0 ABC 123-3 YN
You can also use OrderedDict from the collections library and make sure you keep the order you want by passing an iterator that produces tuples in the correct order
from collections import OrderedDict
names = ['first_col', 'second_col', 'third_col']
widths = [3, 5, 2]
header = OrderedDict(zip(names, widths))
pd.read_fwf(
'myfile.txt',
widths=header.values(),
names=header.keys()
)
first_col second_col third_col
0 ABC 123-3 YN
Demonstration
from collections import OrderedDict
txt = """ABC123-3YN0000000001203ABC123*TESTINGLINE"""
names = ['first_col', 'second_col', 'third_col']
widths = [3, 5, 2]
header = OrderedDict(zip(names, widths))
pd.read_fwf(
'myfile.txt',
widths=header.values(),
names=header.keys()
)
first_col second_col third_col
0 ABC 123-3 YN
First, I want to say that I am new to programming. That said, using Python 2.76, I'm trying to take a text file, read it in with csv, and then create a dictionary with a key equal to the first column in the file. Here is an example of the type of file I want to use (sorry for the bad formatting, there are three columns, each with a given value of either visitid, date, or time):
visitid cdate ctime
OMHioJh8XEeq7152 6/15/2007 06:00
OMHioJh8XEeq7152 6/14/2007 07:10
OMHioJh8XEeq7152 6/11/2007 14:21
t2v0TjgroLTI6118 4/28/2006 14:18
t2v0TjgroLTI6118 5/1/2006 04:00
Specifically, given this kind of list, I want to make a key in the dictionary equal to the value of the first column, and for the value have the remaining columns as a list. Finally, I want to append the value with another list if there are duplicates of the value in column 1 to form a list of lists, so to speak. This is what I have so far, after doing some research on here and elsewhere:
def test_results(filename):
import csv
with open(filename,"rU") as f:
reader = csv.reader(f,delimiter="\t")
result = {}
for row in reader:
key = row[0]
if key in result:
result[row[0]].append(row[1])
else:
result[row[0]] = key
result[key]=row[1:]
print result
This works, but it does not append the values to make a list of lists, and only adds to the dictionary the last row for any unique visitID.
Thanks!
You should use defaultdict:
from collections import defaultdict
import csv
def test_results(filename):
with open(filename, "rU") as f:
reader = csv.reader(f, delimiter="\t")
result = defaultdict(list)
# Skip header row
next(reader)
for row in reader:
result[row[0]].append(row[1:])
defaultdict(list) will assume an empty list if the key is not present in the dictionary. Given the input provide in the question, result will contain:
{'OMHioJh8XEeq7152': [['6/15/2007', '06:00'],
['6/14/2007', '07:10'],
['6/11/2007', '14:21']],
't2v0TjgroLTI6118': [['4/28/2006', '14:18'],
['5/1/2006', '04:00']]}
If you want a more flexible format, you should convert your date and time strings into a datetime object using dateutil.parser.parse:
import csv
from collections import defaultdict
from dateutil import parser
def test_results(filename):
with open(filename, "rU") as f:
reader = csv.reader(f, delimiter="\t")
result = defaultdict(list)
# Skip header line
next(reader)
for row in reader:
result[row[0]].append(parser.parse(' '.join(row[1:])))
Which yields:
{'OMHioJh8XEeq7152': [datetime.datetime(2007, 6, 15, 6, 0),
datetime.datetime(2007, 6, 14, 7, 10),
datetime.datetime(2007, 6, 11, 14, 21)],
't2v0TjgroLTI6118': [datetime.datetime(2006, 4, 28, 14, 18),
datetime.datetime(2006, 5, 1, 4, 0)]}
Maybe something like this:
if key in result:
result[row[0]].append(row[1:])
else:
result[row[0]] = key
result[key] = [row[1:]]
I have been trying to find the frequency distribution of nouns in a given sentence. If I do this:
text = "This ball is blue, small and extraordinary. Like no other ball."
text=text.lower()
token_text= nltk.word_tokenize(text)
tagged_sent = nltk.pos_tag(token_text)
nouns= []
for word,pos in tagged_sent:
if pos in ['NN',"NNP","NNS"]:
nouns.append(word)
freq_nouns=nltk.FreqDist(nouns)
print freq_nouns
It considers "ball" and "ball." as separate words. So I went ahead and tokenized the sentence before tokenizing the words:
text = "This ball is blue, small and extraordinary. Like no other ball."
text=text.lower()
sentences = nltk.sent_tokenize(text)
words = [nltk.word_tokenize(sent)for sent in sentences]
tagged_sent = [nltk.pos_tag(sent)for sent in words]
nouns= []
for word,pos in tagged_sent:
if pos in ['NN',"NNP","NNS"]:
nouns.append(word)
freq_nouns=nltk.FreqDist(nouns)
print freq_nouns
It gives the following error:
Traceback (most recent call last):
File "C:\beautifulsoup4-4.3.2\Trial.py", line 19, in <module>
for word,pos in tagged_sent:
ValueError: too many values to unpack
What am I doing wrong? Please help.
You were so close!
In this case, you changed your tagged_sent from a list of tuples to a list of lists of tuples because of your list comprehension tagged_sent = [nltk.pos_tag(sent)for sent in words].
Here's some things you can do to discover what type of objects you have:
>>> type(tagged_sent), len(tagged_sent)
(<type 'list'>, 2)
This shows you that you have a list; in this case of 2 sentences. You can further inspect one of those sentences like this:
>>> type(tagged_sent[0]), len(tagged_sent[0])
(<type 'list'>, 9)
You can see that the first sentence is another list, containing 9 items. Well, what does one of those items look like? Well, lets look at the first item of the first list:
>>> tagged_sent[0][0]
('this', 'DT')
If your curious to see the entire object, which I frequently am, you can ask the pprint (pretty-print) module to make it nicer to look at like this:
>>> from pprint import pprint
>>> pprint(tagged_sent)
[[('this', 'DT'),
('ball', 'NN'),
('is', 'VBZ'),
('blue', 'JJ'),
(',', ','),
('small', 'JJ'),
('and', 'CC'),
('extraordinary', 'JJ'),
('.', '.')],
[('like', 'IN'), ('no', 'DT'), ('other', 'JJ'), ('ball', 'NN'), ('.', '.')]]
So, the long answer is your code needs to iterate over the new second layer of lists, like this:
nouns= []
for sentence in tagged_sent:
for word,pos in sentence:
if pos in ['NN',"NNP","NNS"]:
nouns.append(word)
Of course, this just returns a non-unique list of items, which look like this:
>>> nouns
['ball', 'ball']
You can unique-ify this list in many different ways, but you can quickly by using the set() data structure, like so:
unique_nouns = list(set(nouns))
>>> print unique_nouns
['ball']
For an examination of other ways you can unique-ify a list of items, see the slightly older but extremely useful: http://www.peterbe.com/plog/uniqifiers-benchmark
I am creating a webcrawler and in the first step, I need to crawl a website and extract all its link however my code is not looping to extract. I tried using append but that results in a list of lists. I'm trying to use foo and I get an error. Any help would be appreciated. Thank you
from urllib2 import urlopen
import re
def get_all_urls(url):
get_content = urlopen(url).read()
url_list = []
find_url = re.compile(r'a\s?href="(.*)">')
url_list_temp = find_url.findall(get_content)
for i in url_list_temp:
url_temp = url_list_temp.pop()
source = 'http://blablabla/'
url = source + url_temp
url_list.append(url)
#print url_list
return url_list
def web_crawler(seed):
tocrawl = [seed]
crawled = []
i = 0
while i < len(tocrawl):
page = tocrawl.pop()
if page not in crawled:
#tocrawl.append(get_all_urls(page))
foo = (get_all_urls(page))
tocrawl = foo
crawled.append(page)
if not tocrawl:
break
print crawled
return crawled
First of all, it's a bad idea to parse HTML with regular expressions, for all reasons listed:
here: Python regular expression for HTML parsing (BeautifulSoup)
here: Python regex to match HTML
here: regexp python with parsing html page
and so on..
You should use an HTML parser to to the job. Python provides one in its standard library: HTMLParser, but you could also use BeautifulSoup or even lxml. I tend to favor BeautifulSoup, for its nice API.
Now, back to your problem, you're modifying the list you're iterating on:
for i in url_list_temp:
url_temp = url_list_temp.pop()
source = 'http://blablabla/'
...
This is bad, because it metaphorically amounts to sawing a branch you're sitting on.
Also, you do not seem to require this removal, as there is no condition for which an url must be removed or not.
Finally, you get an error after using append because, as you said, it creates a list of list. You should use extend instead.
>>> l1 = [1, 2, 3]
>>> l2 = [4, 5, 6]
>>> l1.append(l2)
>>> l1
[1, 2, 3, [4, 5, 6]]
>>> l1 = [1, 2, 3]
>>> l1.extends(l2)
>>> l1
[1, 2, 3, 4, 5, 6]
NB: Take a look at http://www.pythonforbeginners.com/python-on-the-web/web-scraping-with-beautifulsoup/ for additional help with scraping with beautifulsoup