Scoring multiple TRUES in Pythton RE Search - regex

Background
I have a list of "bad words" in a file called bad_words.conf, which reads as follows
(I've changed it so that it's clean for the sake of this post but in real-life they are expletives);
wrote (some )?rubbish
swore
I have a user input field which is cleaned and striped of dangerous characters before being passed as data to the following script, score.py
(for the sake of this example I've just typed in the value for data)
import re
data = 'I wrote some rubbish and swore too'
# Get list of bad words
bad_words = open("bad_words.conf", 'r')
lines = bad_words.read().split('\n')
combine = "(" + ")|(".join(lines) + ")"
#set score incase no results
score = 0
#search for bad words
if re.search(combine, data):
#add one for a hit
score += 1
#show me the score
print(str(score))
bad_words.close()
Now this finds a result and adds a score of 1, as expected, without a loop.
Question
I need to adapt this script so that I can add 1 to the score every time a line of "bad_words.conf" is found within text.
So in the instance above, data = 'I wrote some rubbish and swore too' I would like to actually score a total of 2.
1 for "wrote some rubbish" and +1 for "swore".
Thanks for the help!

Changing combine to just:
combine = "|".join(lines)
And using re.findall():
In [33]: re.findall(combine,data)
Out[33]: ['rubbish', 'swore']
The problem with having the multiple capturing groups as you originally were doing is that re.findall() will return each additional one of those as an empty string when one of the words is matched.

Related

Exiting from a chunk when "." is encountered

I want the chunker to consider each single line to make chunks using the grammar provided and for that I am considering "." to exit but it doesn't seem to work.
I tried <\\.$>
and <[.]$>
But it seem to have no effect while chunking.
def pos_tag(sentence):
tags = clf.predict([features(sentence, index) for index in range(len(sentence))])
return zip(sentence, tags)
a = list(pos_tag(word_tokenize("He worked quite well. I find him good.")))
grammar = r""" ZZZZZZ:
{<PRP.>+<VB.>+<[.]$|.*>*<RB.>*}
{<PRP.|NN.>+<RB.>+<VB.>?}
{<JJ.>+<[.]$|.*>*<NN.|PRP.>+}
{<NN.|PRP.>+<[.]$|.*>*<JJ.>+}
{<RB.>*<[.]$|.*>*<NN.>*}
{<DT|PRP.>?<JJ.>+<NN>+}
"""
cp = nltk.RegexpParser(grammar)
chunked = cp.parse(a)
z = []
for subtree in chunked.subtrees(filter=lambda t: t.label() == 'ZZZZZZ'):
z.append(subtree)
After I print z I am getting output.
I am getting the entire input string as a chunk without it being split at ".".
The second last line basically is an adjective and any number of words till a noun is found or a full stop is encountered.
The <.> is making the entire input as single valid chunk where as it should have stopped at the next full stop.

Python .splitlines() to segment text into separate variables

I've read the other threads on this site but haven't quite grasped how to accomplish what I want to do. I'd like to find a method like .splitlines() to assign the first two lines of text in a multiline string into two separate variables. Then group the rest of the text in the string together in another variable.
The purpose is to have consistent data-sets to write to a .csv using the three variables as data for separate columns.
Title of a string
Description of the string
There are multiple lines under the second line in the string!
There are multiple lines under the second line in the string!
There are multiple lines under the second line in the string!
Any guidance on the pythonic way to do this would be appreciated.
Using islice
In addition to normal list slicing you can use islice() which is more performant when generating slices of larger lists.
Code would look like this:
from itertools import islice
with open('input.txt') as f:
data = f.readlines()
first_line_list = list(islice(data, 0, 1))
second_line_list = list(islice(data, 1, 2))
other_lines_list = list(islice(data, 2, None))
first_line_string = "".join(first_line_list)
second_line_string = "".join(second_line_list)
other_lines_string = "".join(other_lines_list)
However, you should keep in mind that the data source you read from is long enough. If it is not, it will raise a StopIteration error when using islice() or an IndexError when using normal list slicing.
Using regex
The OP asked for a list-less approach additionally in the comments below.
Since reading data from a file leads to a string and via string-handling to lists later on or directly to a list of read lines I suggested using a regex instead.
I cannot tell anything about performance comparison between list/string handling and regex operations. However, this should do the job:
import re
regex = '(?P<first>.+)(\n)(?P<second>.+)([\n]{2})(?P<rest>.+[\n])'
preg = re.compile(regex)
with open('input.txt') as f:
data = f.read()
match = re.search(regex, data, re.MULTILINE | re.DOTALL)
first_line = match.group('first')
second_line = match.group('second')
rest_lines = match.group('rest')
If I understand correctly, you want to split a large string into lines
lines = input_string.splitlines()
After that, you want to assign the first and second line to variables and the rest to another variable
title = lines[0]
description = lines[1]
rest = lines[2:]
If you want 'rest' to be a string, you can achieve that by joining it with a newline character.
rest = '\n'.join(lines[2:])
A different, very fast option is:
lines = input_string.split('\n', maxsplit=2) # This only separates the first to lines
title = lines[0]
description = lines[1]
rest = lines[2]

IndexError: list index out of range for list of lists in for loop

I've looked at the other questions posted on the site about index error, but I'm still not understanding how to fix my own code. Im a beginner when it comes to Python. Based on the users input, I want to check if that input lies in the fourth position of each line in the list of lists.
Here's the code:
#create a list of lists from the missionPlan.txt
from __future__ import with_statement
listoflists = []
with open("missionPlan.txt", "r") as f:
results = [elem for elem in f.read().split('\n') if elem]
for result in results:
listoflists.append(result.split())
#print(listoflists)
#print(listoflists[2][3])
choice = int(input('Which command would you like to alter: '))
i = 0
for rows in listoflists:
while i < len(listoflists):
if listoflists[i][3]==choice:
print (listoflists[i][0])
i += 1
This is the error I keep getting:
not getting inside the if statement
So, I think this is what you're trying to do - find any line in your "missionPlan.txt" where the 4th word (after splitting on whitespace) matches the number that was input, and print the first word of such lines.
If that is indeed accurate, then perhaps something along this line would be a better approach.
choice = int(input('Which command would you like to alter: '))
allrecords = []
with open("missionPlan.txt", "r") as f:
for line in f:
words = line.split()
allrecords.append(words)
try:
if len(words) > 3 and int(words[3]) == choice:
print words[0]
except ValueError:
pass
Also, if, as your tags suggest, you are using Python 3.x, I'm fairly certain the from __future__ import with_statement isn't particularly necessary...
EDIT: added a couple lines based on comments below. Now in addition to examining every line as it's read, and printing the first field from every line that has a fourth field matching the input, it gathers each line into the allrecords list, split into separate words as a list - corresponding to the original questions listoflists. This will enable further processing on the file later on in the code. Also fixed one glaring mistake - need to split line into words, not f...
Also, to answer your "I cant seem to get inside that if statement" observation - that's because you're comparing a string (listoflists[i][3]) with an integer (choice). The code above addresses both that comparison mismatch and the check for there actually being enough words in a line to do the comparison meaningfully...

Learn Python the Hard Way Ex.41 Confused About For loop

I am having trouble understanding how one of the for loops works in Learn Python the Hard Way ex.41. http://learnpythonthehardway.org/book/ex41.html Below is the code from the lesson.
The loop that I am confused about is for i in range(0, snippet.count("###")):
Is it iterating over a range of 0 to snippet (of which there are 6 snippet), and adding the extra value of the count of "###"? So for the next line of code param_count = random.randint(1,3) the extra value of "###" is applied? Or am I way off!?
Cheers
Darren
import random
from urllib import urlopen
import sys
WORD_URL = "http://learncodethehardway.org/words.txt"
WORDS = []
PHRASES = {
"class %%%(%%%):":
"Make a class named %%% that is-a %%%.",
"class %%%(object):\n\tdef __init__(self, ***)" :
"class %%% has-a __init__ that takes self and *** parameters.",
"class %%%(object):\n\tdef ***(self, ###)":
"class %%% has-a function named *** that takes self and ### parameters.",
"*** = %%%()":
"Set *** to an instance of class %%%.",
"***.***(###)":
"From *** get the *** function, and call it with parameters self, ###.",
"***.*** = '***'":
"From *** get the *** attribute and set it to '***'."
}
# do they want to drill phrases first
PHRASE_FIRST = False
if len(sys.argv) == 2 and sys.argv[1] == "english":
PHRASE_FIRST = True
# load up the words from the website
for word in urlopen(WORD_URL).readlines():
WORDS.append(word.strip())
def convert(snippet, phrase):
class_names = [w.capitalize() for w in
random.sample(WORDS, snippet.count("%%%"))]
other_names = random.sample(WORDS, snippet.count("***"))
results = []
param_names = []
for i in range(0, snippet.count("###")):
param_count = random.randint(1,3)
param_names.append(', '.join(random.sample(WORDS, param_count)))
for sentence in snippet, phrase:
result = sentence[:]
# fake class names
for word in class_names:
result = result.replace("%%%", word, 1)
# fake other names
for word in other_names:
result = result.replace("***", word, 1)
# fake parameter lists
for word in param_names:
result = result.replace("###", word, 1)
results.append(result)
return results
# keep going until they hit CTRL-D
try:
while True:
snippets = PHRASES.keys()
random.shuffle(snippets)
for snippet in snippets:
phrase = PHRASES[snippet]
question, answer = convert(snippet, phrase)
if PHRASE_FIRST:
question, answer = answer, question
print question
raw_input("> ")
print "ANSWER: %s\n\n" % answer
except EOFError:
print "\nBye"
snippet.count("###") returns the number of times "###" appears in snippet.
If "###" appears 6 times, then the for-loop iterates from 0 to 6.
"try except" block runs the program until the user hits ^ D.
"While True" loop inside "try" stores list of keys from PHRASES dictonary into snippets. The order of keys is different each time (because of shuffle method). "for loop" inside that "While loop" is to go through each snippet and call convert method on key and value of that snippet.
All "convert method" does it to replace %%%, ***, and ### of that key and value with a random word from the url list of words and return a list (results) consists of two strings: one made from the key and one made from the value.
Then the program prints one of the strings as a question, then gets user input (using raw_input("> ")), but no matter what the user entered, it prints the other returned string as the answer.
Inside convert method, we have three different lists : class_names, other_names, and param_names.
To make class_names, the program counts the number of %%% isnide that key (or value, but they are the same numbers of %%% in them anyways). class_names will be a random list of words in the size of the count of %%%.
other_names is a random list of words again. How many words? in the number of *** found in key (or value, does not matter which one because it is the same in any pairs of them)
param_names is a list of strings in the size of the number of ### found. Each string consists of one, two or three different words seperated by ,.
'result' is a string. The program goes over the three lists (class_names, param_names and other_names), and replace something in result string with what it already made ready for it. Then append this into results list. The (for sentence in snippet, phrase:) loop runs two times because 'snippet' and 'phrase' are two different strings. So, 'result' string is being made two times (one for question one for answer).
I put one part of this program to a smaller sub program to clarify how a list of a certain size from random words in the url is created:
https://github.com/MahshidZ/python-recepies/blob/master/random_word_set.py
Finally, I suggest to put print statements any where in code that you need to understand better. An an example, for this code I printed a number of variables to get exactly what is going on. This is a good way of debugging without a debugger: (look for the boolean variable DEBUG in my code)
DEBUG = 1
if DEBUG:
print "snippet: " , snippet
print "phrase: ", phrase
print "class names: ", class_names
print "other names: " , other_names
print "param names: ", param_names

Select random group of items from txt file

I'm working on a simple Python game where the computer tries to guess a number you think of. Every time it guesses the right answer, it saves the answer to a txt file. When the program is run again, it will guess the old answers first (if they're in the range the user specifies).
try:
f = open("OldGuesses.txt", "a")
r = open("OldGuesses.txt", "r")
except IOError as e:
f = open("OldGuesses.txt", "w")
r = open("OldGuesses.txt", "r")
data = r.read()
number5 = random.choice(data)
print number5
When I run that to pull the old answers, it grabs one item. Like say I have the numbers 200, 1242, and 1343, along with spaces to tell them apart, it will either pick a space, or a single digit. Any idea how to grab the full number (like 200) and/ or avoid picking spaces?
The r.read() call reads the entire contents of r and returns it as a single string. What you can do is use a list comprehension in combination with r.readlines(), like this:
data = [int(x) for x in r.readlines()]
which breaks up the file into lines and converts each line to an integer.