Print line if any of these words are matched - python-2.7

I have a text file with 1000+ lines, each one representing a news article about a topic that I'm researching. Several hundred lines/articles in this dataset are not about the topic, however, and I need to remove these.
I've used grep to remove many of them (grep -vwE "(wordA|wordB)" test8.txt > test9.txt), but I now need to go through the rest manually.
I have a working code that finds all lines that do not contain a certain word, prints this line to me, and asks if it should be removed or not. It works well, but I'd like to include several other words. E.g. let's say my research topic is meat eating trends. I hope to write a script that prints lines that do not contain 'chicken' or 'pork' or 'beef', so I can manually verify if the lines/articles are about the relevant topic.
I know I can do this with elif, but I wonder if there is a better and simpler way? E.g. I tried if "chicken" or "beef" not in line: but it did not work.
Here's the code I have:
orgfile = 'text9.txt'
newfile = 'test10.txt'
newFile = open(newfile, 'wb')
with open("test9.txt") as f:
for num, line in enumerate(f, 1):
if "chicken" not in line:
print "{} {}".format(line.split(',')[0], num)
testVar = raw_input("1 = delete, enter = skip.")
testVar = testVar.replace('', '0')
testVar = int(testVar)
if testVar == 10:
print ''
os.linesep
else:
f = open(newfile,'ab')
f.write(line)
f.close()
else:
f = open(newfile,'ab')
f.write(line)
f.close()
Edit: I tried Pieter's answer to this question but it does not work here, presumeably because I am not working with integers.

you can use any or all and a generator. For example
>>> key_word={"chicken","beef"}
>>> test_texts=["the price of beef is too high", "the chicken farm now open","tomorrow there is a lunar eclipse","bla"]
>>> for title in test_texts:
if any(key in title for key in key_words):
print title
the price of beef is too high
the chicken farm now open
>>>
>>> for title in test_texts:
if not any(key in title for key in key_words):
print title
tomorrow there is a lunar eclipse
bla
>>>

Related

Python:How can you recursively search a .txt file, find matches and print results

I have been searching for an answer to this, but can not seem to get what I need. I would like a python script that reads my text file and starting from the top working its way through each line of the file and then prints out all the matches in another txt file. Content of the text file is just 4 digit numbers like 1234.
example
1234
3214
4567
8963
1532
1234
...and so on.
I would like the output to be something like:
1234 : matches found = 2
I know that there are matches in the file do to almost 10000 lines. I appreciate any help. If someone could just point me in the right direction here would be great. Thank you.
import re
file = open("filename", 'r')
fileContent=file.read()
pattern="1234"
print len(re.findall(pattern,fileContent))
If I were you I would open the file and use the split method to create a list with all the numbers in and use the Counter method from collections to count how many of each number in the list are dupilcates.
`
from collections import Counter
filepath = 'original_file'
new_filepath = 'new_file'
file = open(filepath,'r')
text = file.read()
file.close()
numbers_list = text.split('\n')
numbers_set = set(numbers_list)
dupes = [[item,':matches found =',str(count)] for item,count in Counter(numbers_list).items() if count > 1]
dupes = [' '.join(i) for i in dupes]
new_file = open(new_filepath,'w')
for i in dupes:
new_file.write(i)
new_file.close()
`
Thanks to everyone who helped me on this. Thank you to #csabinho for the code he provided and to #IanAuld for asking me "Why do you think you need recursion here?" – IanAuld. It got me to thinking that the solution was a simple one. I just wanted to know which 4 digit numbers had duplicates and how many, and also which 4 digit combos were unique. So this is what I came up with and it worked beautifully!
import re
a=999
while a <9999:
a = a+1
file = open("4digits.txt", 'r')
fileContent = file.read()
pattern = str(a)
result = len(re.findall(pattern, fileContent))
if result >= 1:
print(a,"matches",result)
else:
print (a,"This number is unique!")

Converting a textfile into a list

I have a text file which contains a series of movie titles, which looks like this once opened.
A Nous la Liberte (1932) About Schmidt (2002) Absence of Malice
(1981) Adam's Rib (1949) Adaptation (2002) The Adjuster (1991) The
Adventures of Robin Hood (1938) Affliction (1998) The African Queen
(1952)
Using the code below:
def movie_text():
moviefile = open("movies.txt", 'r')
yourResult = [line.split('\n') for line in moviefile.readlines()]
movie_text()
I get nothing.
Your code doesn't prints right.
If I understand it well,
moviefile = open("movies.txt", 'r')
lines=moviefile.readlines()
print(len(lines)) # Shows list size
for line in lines:
print(line[:1]) # The [:1] part cuts the \n
The method readlines returns a list, I am not sure why your use split. I mean, if all you want is to remove the '\n', you can do it in many ways, being the one I used just one of them.
Hope it works!

Mutliple output files created but empty

I am trying to split one file with two articles in it into two separate files with one article in each, for subsequent analysis of the articles. Each article in the initial file has an ID that I want to use to separate the files with, using RE.
Below is the initial input file, with ID number:
166068619 #### "Epilepsy: let's end our ignorance of this neglected condition
Helen Stephens is a young woman with epilepsy [...]."
106899978 #### "Great British Payoff shows that BBC governance is broken
If it was a television series, they'd probably call it [...]."
However, when I run my code, I do get two separate files as an output but they are empty.
This is my code:
def file_split(path_to_file):
"""Function splits bigger file into N smaller ones, based on a certain RE
match, that is used to break the bigger file into smaller ones"""
def pattern_extract(path_to_file):
"""Function identifies the number of RE occurences in a file,
No. can be used in further analysis as range No."""
import re
x = []
with open(path_to_file) as f:
for line in f:
match = re.search(r'^\d+?\t####\t', line)
if match:
a = match.group()
x.append(a)
return len(x)
y = pattern_extract(path_to_file)
m = y + 1
files = [open('filename%i.txt' %i, 'w') for i in range(1,m)]
with open(path_to_file) as f:
for line in f:
match = re.search(r'^\d+?\t####\t', line)
if match:
a = match.group()
#files = [open('filename%i.txt' %i, 'w') for i in range(1, m)]
files[i-1].write(a)
for f in files:
f.close()
return files
Output result is as follows:
file_split(path)
Out[19]:
[<open file 'filename1.txt', mode 'w' at 0x7fe121b130c0>,
<open file 'filename2.txt', mode 'w' at 0x7fe121b131e0>]
I am new to Python and I am not quite sure where the problem lies. I checked some other answers that addressed the multiple file outputs but cannot figure out the solution. Help would be very much appreciated.
There are two problems with your code:
you write only the line matching the ID (actually, just the match itself), not the rest
you are always writing to the last file, as you use i, the loop variable "left over" from the list comprehension
To fix it, you could change the lower portion of your code to this:
y = pattern_extract(path_to_file)
files = [open('filename%i.txt' %i, 'w') for i in range(y)]
n = -1
with open(path_to_file) as f:
for line in f:
if re.search(r'^\d+\s+####\s+', line):
n += 1
files[n].write(line)
But you do not have to read the file two times at all, just to count the matches: Just open another file when the line matches an ID line and directly write to that last file in the list, then close all the files.
open_files = []
with open(path_to_file) as f:
for line in f:
if re.search(r'^\d+\s+####\s+', line):
open_files.append(open('filename%d.txt' % len(open_files), 'w'))
open_files[-1].write(line)
for f in open_files:
f.close()

Converting a list from a .txt file into a dictionary

Ok, I've tried all the methods in Convert a list to a dictionary in Python, but I can't seem to get this to work right. I'm trying to convert a list that I've made from a .txt file into a dictionary. So far my code is:
import os.path
from tkinter import *
from tkinter.filedialog import askopenfilename
import csv
window = Tk()
window.title("Please Choose a .txt File")
fileName = askopenfilename()
classInfoList = []
classRoster = {}
with open(fileName, newline = '') as listClasses:
for line in csv.reader(listClasses):
classInfoList.append(line)
The .txt file is in the format:
professor
class
students
An example would be:
Professor White
Chem 101
Jesse Pinkman, Brandon Walsh, Skinny Pete
The output I desire would be a dictionary with professors as the keys, and then the class and list of students for the values.
OUTPUT:
{"Professor White": ["Chem 101", [Jesse Pinkman, Brandon Walsh, Skinny Pete]]}
However, when I tried the things in the above post, I kept getting errors.
What can I do here?
Thanks
Since the data making up your dictionary is on consecutive lines, you will have to process three lines at once. You can use the next() method on the file handle like this:
output = {}
input_file = open('file1')
for line in input_file:
key = line.strip()
value = [next(input_file).strip()]
value.append(next(input_file).split(','))
output[key] = value
input_file.close()
This would give you:
{'Professor White': ['Chem 101',
['Jesse Pinkman, Brandon Walsh, Skinny Pete']]}

Python 2.7.3: Search/Count txt file for string, return full line with final occurrence of that string

I'm trying to create a WiFi Log Scanner. Currently we go through logs manually using CTRL+F and our keywords. I just want to automate that process. i.e. bang in a .txt file and receive an output.
I've got the bones of the code, can work on making it pretty later, but I'm running into a small issue. I want the scanner to search the file (done), count instances of that string (done) and output the number of occurrences (done) followed by the full line where that string occurred last, including line number (line number is not essential, just makes things easier to do a gestimate of which is the more recent issue if there are multiple).
Currently I'm getting an output of every line with the string in it. I know why this is happening, I just can't think of a way to specify just output the last line.
Here is my code:
import os
from Tkinter import Tk
from tkFileDialog import askopenfilename
def file_len(filename):
#Count Number of Lines in File and Output Result
with open(filename) as f:
for i, l in enumerate(f):
pass
print('There are ' + str(i+1) + ' lines in ' + os.path.basename(filename))
def file_scan(filename):
#All Issues to Scan will go here
print ("DHCP was found " + str(filename.count('No lease, failing')) + " time(s).")
for line in filename:
if 'No lease, failing' in line:
print line.strip()
DNS= (filename.count('Host name lookup failure:res_nquery failed') + filename.count('HTTP query failed'))/2
print ("DNS Failure was found " + str(DNS) + " time(s).")
for line in filename:
if 'Host name lookup failure:res_nquery failed' or 'HTTP query failed' in line:
print line.strip()
print ("PSK= was found " + str(testr.count('psk=')) + " time(s).")
for line in ln:
if 'psk=' in line:
print 'The length(s) of the PSK used is ' + str(line.count('*'))
Tk().withdraw()
filename=askopenfilename()
abspath = os.path.abspath(filename) #So that doesn't matter if File in Python Dir
dname = os.path.dirname(abspath) #So that doesn't matter if File in Python Dir
os.chdir(dname) #So that doesn't matter if File in Python Dir
print ('Report for ' + os.path.basename(filename))
file_len(filename)
file_scan(filename)
That's, pretty much, going to be my working code (just have to add a few more issue searches), I have a version that searches a string instead of a text file here. This outputs the following:
Total Number of Lines: 38
DHCP was found 2 time(s).
dhcp
dhcp
PSK= was found 2 time(s).
The length(s) of the PSK used is 14
The length(s) of the PSK used is 8
I only have general stuff there, modified for it being a string rather than txt file, but the string I'm scanning from will be what's in the txt files.
Don't worry too much about PSK, I want all examples of that listed, I'll see If I can tidy them up into one line at a later stage.
As a side note, a lot of this is jumbled together from doing previous searches, so I have a good idea that there are probably neater ways of doing this. This is not my current concern, but if you do have a suggestion on this side of things, please provide an explanation/link to explanation as to why your way is better. I'm fairly new to python, so I'm mainly dealing with stuff I currently understand. :)
Thanks in advance for any help, if you need any further info, please let me know.
Joe
To search and count the string occurrence I solved in following way
'''---------------------Function--------------------'''
#Counting the "string" occurrence in a file
def count_string_occurrence():
string = "test"
f = open("result_file.txt")
contents = f.read()
f.close()
print "Number of '" + string + "' in file", contents.count("foo")
#we are searching "foo" string in file "result_file.txt"
I can't comment yet on questions, but I think I can answer more specifically with some more information What line do you want only one of?
For example, you can do something like:
search_str = 'find me'
count = 0
for line in file:
if search_str in line:
last_line = line
count += 1
print '{0} occurrences of this line:\n{1}'.format(count, last_line)
I notice that in file_scan you are iterating twice through file. You can surely condense it into one iteration :).