I want the chunker to consider each single line to make chunks using the grammar provided and for that I am considering "." to exit but it doesn't seem to work.
I tried <\\.$>
and <[.]$>
But it seem to have no effect while chunking.
def pos_tag(sentence):
tags = clf.predict([features(sentence, index) for index in range(len(sentence))])
return zip(sentence, tags)
a = list(pos_tag(word_tokenize("He worked quite well. I find him good.")))
grammar = r""" ZZZZZZ:
{<PRP.>+<VB.>+<[.]$|.*>*<RB.>*}
{<PRP.|NN.>+<RB.>+<VB.>?}
{<JJ.>+<[.]$|.*>*<NN.|PRP.>+}
{<NN.|PRP.>+<[.]$|.*>*<JJ.>+}
{<RB.>*<[.]$|.*>*<NN.>*}
{<DT|PRP.>?<JJ.>+<NN>+}
"""
cp = nltk.RegexpParser(grammar)
chunked = cp.parse(a)
z = []
for subtree in chunked.subtrees(filter=lambda t: t.label() == 'ZZZZZZ'):
z.append(subtree)
After I print z I am getting output.
I am getting the entire input string as a chunk without it being split at ".".
The second last line basically is an adjective and any number of words till a noun is found or a full stop is encountered.
The <.> is making the entire input as single valid chunk where as it should have stopped at the next full stop.
Related
Background
I have a list of "bad words" in a file called bad_words.conf, which reads as follows
(I've changed it so that it's clean for the sake of this post but in real-life they are expletives);
wrote (some )?rubbish
swore
I have a user input field which is cleaned and striped of dangerous characters before being passed as data to the following script, score.py
(for the sake of this example I've just typed in the value for data)
import re
data = 'I wrote some rubbish and swore too'
# Get list of bad words
bad_words = open("bad_words.conf", 'r')
lines = bad_words.read().split('\n')
combine = "(" + ")|(".join(lines) + ")"
#set score incase no results
score = 0
#search for bad words
if re.search(combine, data):
#add one for a hit
score += 1
#show me the score
print(str(score))
bad_words.close()
Now this finds a result and adds a score of 1, as expected, without a loop.
Question
I need to adapt this script so that I can add 1 to the score every time a line of "bad_words.conf" is found within text.
So in the instance above, data = 'I wrote some rubbish and swore too' I would like to actually score a total of 2.
1 for "wrote some rubbish" and +1 for "swore".
Thanks for the help!
Changing combine to just:
combine = "|".join(lines)
And using re.findall():
In [33]: re.findall(combine,data)
Out[33]: ['rubbish', 'swore']
The problem with having the multiple capturing groups as you originally were doing is that re.findall() will return each additional one of those as an empty string when one of the words is matched.
I am newer to scripting, so my code may be a bit mangled, I apologize in advance.
I am trying to iterate through a CSV and write it to an excel workbook using openpyxl. But before I write it, I am performing a few checks to determine which sheet to write the row to.
The row has content such as:"KB4462941", "kb/9191919", "kb -919", "sdfklKB91919".
I am trying to pull the first numbers following "KB" then stop reading in characters once a non-numeric character is found. One I find it, then I run a separate function that queries a DB. That function works.
The problem I am running into, is once it finds the first KB: KB4462941, it gets hung up and goes over that KB multiple times until the last time it appears in that row, then the program finishes.
Unfortunately, there is not default location for where the KB characters will be in the row, and there is no default character count between the KB and the first numbers.
My code:
with open('test.csv') as file:
reader = csv.reader(file, delimiter = ',')
for row in reader:
if str(row).find("SSL") != -1:
ws = book.get_sheet_by_name('SSL')
ws.append(row)
else:
mylist = list(row)
string = ''.join(mylist)
tmplist = list()
resultlist = list()
pattern = 'KB.*[0-9]*'
for i in mylist:
tmplist += re.findall(pattern, i, re.IGNORECASE)
for i in tmplist:
resultlist += re.findall('[0-9]*', i)
for i in resultlist:
if len(i) > 4:
print i
if dbFunction(i) == 1:
ws = book.get_sheet_by_name('Found')
ws.append(row)
else:
ws = book.get_sheet_by_name('Nothing')
ws.append(row)
output:
1st row is skipped
2nd row is in the right place
3rd and 4th row in the right place
5th row is written for the next nine 9 rows.
never gets to the following 3 rows.
I am trying to run the following code:
fname = raw_input ('Enter file name:')
fh = open (fname)
count = 0
for line in fh:
if not line.startswith ('X-DSPAM-Confidence:') : continue
else:
count = count + 1
new = fh #this new = fh is supposed to be fh stripped of the non- x-dspam lines
for line in new: # this seperates the lines in new and allows `finding the floats on each line`
numpos = new.find ('0')
endpos = new.find ('5', numpos)
num = new[numpos:endpos + 1]
float (num)
# should now have a list of floats
print num
The intention of this code is to prompt the user for a file name, open the file, read through the file, compile all the lines that start with X-DSPAM, and extract the float number on these lines. I am fairly new to coding so I realise I may have committed a number of errors, but currently when I try to run it, after putting in the file name I get the return:
I looked around and I have seen that mode 'r' refers to different file modes in python in relation to how the end of the line is handled. However the code I am trying to run is similar to other code I have formulated and it does not have any non-text files inside, the file being opened is a .txt file. Is it something to do with converting a list of strings line by line to a list of float numbers?
Any ideas on what I am doing wrong would be appreciated.
The default mode of handling a file is 'r' - which means 'read', which is what you want. It means the program is going to read the file (as opposed to 'w' - write, or 'a' - append, for example - which would allow you to overwrite the file or append to it, which you don't want in this case).
There are some bugs in your code, which I've tried to indicate in the edited code below.
You don't need to assign new = fh - you're not grabbing lines and passing them to a new file. Rather, you're checking each line against the 'XDSPAM' criteria and if it's a match, you can proceed to parse out the desired numbers. If not, you ignore it and go to the next line.
With that in mind, you can move all of the code from the for line in new to be part of the original if not ... else block.
How you find the end of the number is also a bit off. You set endpos by searching for an occurence of the number 5 - but what I think you want is to find a position 5 characters from the start position (numpos + 5).
(There are other ways to parse the line and pull the number, but I'm going to stick with your logic as indicated by your code, so nothing fancy here.)
You can convert to float in the same statement where you slice the number from the line (as below). It's acceptable to do:
num = line[numpos:endpos+1]
float_num = float(num)
but not necessary. In any event, you want to assign the conversion (float(num)) to a variable - just having float(num) doesn't allow you to pass the converted value to another statement (including print).
You say that you should have 'a list of floats' - the code as corrected below - will give you a display of all the floats, but if you want an actual Python list, there are other steps involved. I don't think you wanted a Python list, but just in case:
numlist = [] # at the beginning, declare a new, empty list
...
# after converting to float, append number to list
XDSPAM.append(num)
print XDSPAMs # at end of program, to print full list
In any event, this edited code works for me with an appropriate file of test data, and outputs the desired float numbers:
fname = raw_input ('Enter file name:')
fh = open (fname)
count = 0
for line in fh:
if not line.startswith ('X-DSPAM-Confidence:') : continue
else:
# there's no need to create the 'new' variable
# any lines that meet the criteria can be processed for numbers
count = count + 1
numpos = line.find ('0')
# i think what you want here is to set an endpoint 5 positions to the right
# but your code was looking for the position of a '5' in the line
endpos = numpos + 5
# you can convert to float and slice in the same statement
num = float(line[numpos:endpos+1])
print num
I've looked at the other questions posted on the site about index error, but I'm still not understanding how to fix my own code. Im a beginner when it comes to Python. Based on the users input, I want to check if that input lies in the fourth position of each line in the list of lists.
Here's the code:
#create a list of lists from the missionPlan.txt
from __future__ import with_statement
listoflists = []
with open("missionPlan.txt", "r") as f:
results = [elem for elem in f.read().split('\n') if elem]
for result in results:
listoflists.append(result.split())
#print(listoflists)
#print(listoflists[2][3])
choice = int(input('Which command would you like to alter: '))
i = 0
for rows in listoflists:
while i < len(listoflists):
if listoflists[i][3]==choice:
print (listoflists[i][0])
i += 1
This is the error I keep getting:
not getting inside the if statement
So, I think this is what you're trying to do - find any line in your "missionPlan.txt" where the 4th word (after splitting on whitespace) matches the number that was input, and print the first word of such lines.
If that is indeed accurate, then perhaps something along this line would be a better approach.
choice = int(input('Which command would you like to alter: '))
allrecords = []
with open("missionPlan.txt", "r") as f:
for line in f:
words = line.split()
allrecords.append(words)
try:
if len(words) > 3 and int(words[3]) == choice:
print words[0]
except ValueError:
pass
Also, if, as your tags suggest, you are using Python 3.x, I'm fairly certain the from __future__ import with_statement isn't particularly necessary...
EDIT: added a couple lines based on comments below. Now in addition to examining every line as it's read, and printing the first field from every line that has a fourth field matching the input, it gathers each line into the allrecords list, split into separate words as a list - corresponding to the original questions listoflists. This will enable further processing on the file later on in the code. Also fixed one glaring mistake - need to split line into words, not f...
Also, to answer your "I cant seem to get inside that if statement" observation - that's because you're comparing a string (listoflists[i][3]) with an integer (choice). The code above addresses both that comparison mismatch and the check for there actually being enough words in a line to do the comparison meaningfully...
I'm working on a simple Python game where the computer tries to guess a number you think of. Every time it guesses the right answer, it saves the answer to a txt file. When the program is run again, it will guess the old answers first (if they're in the range the user specifies).
try:
f = open("OldGuesses.txt", "a")
r = open("OldGuesses.txt", "r")
except IOError as e:
f = open("OldGuesses.txt", "w")
r = open("OldGuesses.txt", "r")
data = r.read()
number5 = random.choice(data)
print number5
When I run that to pull the old answers, it grabs one item. Like say I have the numbers 200, 1242, and 1343, along with spaces to tell them apart, it will either pick a space, or a single digit. Any idea how to grab the full number (like 200) and/ or avoid picking spaces?
The r.read() call reads the entire contents of r and returns it as a single string. What you can do is use a list comprehension in combination with r.readlines(), like this:
data = [int(x) for x in r.readlines()]
which breaks up the file into lines and converts each line to an integer.