Searching for a string in a file with python

Searching for a string in a file with python - python-2.7

I need to do pretty much what a 'grep -i str file' gives back but I have been hitting my head up against this issue for ages.
I have a func called 'siteLookup' that I am passing two parameters: str 's', and file_handle 'f'.
I want to a) determine if there is a (single) occurrence of the string (in this example site="XX001"),
and b) if found, take the line it was found in, and return another field value that I extract from that [found] line back to the caller. (it is a 'csv' lookup). I have had this working periodically but then it will stop working and I cannot see why.
I have tried all of the different 'open' options including f.readlines etc.
#example line: 'XX001,-1,10.1.1.1/30,By+Location/CC/City Name/'
#example from lookupFile.csv: "XX001","Charleston","United States"
sf = open('lookupFile.csv')
def siteLookup(s, f):
site = s.split(',')[0].strip().upper()
if len(site) == 5:
f.seek(0)
for line in f:
if line.find(site)>=0:
city = line.split(',')[1].strip('"').upper()
return city
# else site not found
return -1
else: # len(site) != 5
return -1
city = siteLookup(line, sf)
print(city)
sf.close()
I am getting zero matches in this code. (I have simplified this example code to a single search). I am expecting to get back the name of the city that matches a 5 digit site code - the site code is the first field in the example "line".
Any assistance much appreciated.

Your return is wrongly indented - if the thing you look for is not found in the first line, it will return -1 and not look further.
Use with open(...) as f: to make your code more secure:
with open("lookupFile.csv","w") as f:
f.write("""#example from lookupFile.csv:
"XX001","Charleston","United States"
""")
def siteLookup(s, f):
site = s.split(',')[0].strip().upper()
if len(site) == 5:
f.seek(0)
for line in f:
if site in line: # if site in line is easier for checking
city = line.split(',')[1].strip('"').upper()
return city
# wrongly indented - will return if site not in line
# return -1
# if too short or not found, return -1 - no need for 2 returns
return -1
line = 'XX001,-1,10.1.1.1/30,By+Location/CC/City Name/'
with open('lookupFile.csv') as sf:
city = siteLookup(line, sf)
print(city)
Output:
CHARLESTON

Related

u'String' parsing csv file into dict Python

I am reading in a CSV file and it works quite well, but some of the Strings look like this:
u'Egg'
when trying to convert this to a String I get the Error:
UnicodeEncodeError: 'ascii' codec can't encode character u'\xfc' in position 0: ordinal not in range(128). I have read various Questions similar to this, but trying the solutions provided already resulted in the same error.
Strangely when debugging as you can see in the picture the variable CITY, has the correct supposed to be value. But it still crashes.
below my function:
def readData(filename, delimiter=";"):
"""
Read in our data from a CSV file and create a dictionary of records,
where the key is a unique record ID and each value is dict
"""
data = pd.read_csv(filename, delimiter=delimiter, encoding="UTF-8")
data.set_index("TRNUID")
returnValue = {}
for index, row in data.iterrows():
if index == 0:
print row["CITY"]
else:
if math.isnan(row["DUNS"]) == True:
DUNS = ""
else:
DUNS = str((int(row["DUNS"])))[:-2]
NAME = str(row["NAME"]).encode("utf-8")
STREET = str(row["STREET"]).encode("utf-8")
CITY = row["CITY"]
POSTAL = str(row["POSTAL"]).encode("utf-8")
returnValue[row["TRNUID"]] = {
"DUNS": DUNS,
"NAME": NAME,
"STREET": STREET,
"CITY": CITY,
"POSTAL": POSTAL
}
return returnValue

You're trying to convert to an ASCII string something that inherently cannot be converted to it.
If you look at the unicode character for \xfc, it is a "u" with an umlaut. Indeed, your screenshot of the variables shows "Egg a.d.Guntz" with an umlaut over the "u". The problem is not with "Egg", therefore, but with the continuation.
You could address this by removing all diacritics from your characters (as in this question), but you will lose information.

How to get PyPDF2 to extract text from multiple sequential pages - in range?

I'm trying to get PyPDF2 to extract specific text throughout a document per the code below. It is pulling exactly what I need and eliminating the duplicates, but it is not getting me a list from each page, it seems to only be showing me the text from the last page. What am I doing wrong?
#import PyPDF2 and set extracted text as the page_content variable
import PyPDF2
pdf_file = open('enme2.pdf','rb')
read_pdf = PyPDF2.PdfFileReader(pdf_file)
number_of_pages = read_pdf.getNumPages()
#for loop to get number of pages and extract text from each page
for page_number in range(number_of_pages):
page = read_pdf.getPage(page_number)
page_content = page.extractText()
#initialize the user_input variable
user_input = ""
#function to get the AFE numbers from the pdf document
def get_afenumbers(Y):
#initialize the afe and afelist variables
afe = "A"
afelist = ""
x = ""
#while loop to get only 6 digits after the "A"
while True:
if user_input.upper().startswith("Y") == True:
#Return a list of AFE's
import re
afe = re.findall('[A][0-9]{6}', page_content)
set(afe)
print(set(afe))
break
else:
afe = "No AFE numbers found..."
if user_input.upper().startswith("N") == True:
print("HAVE A GREAT DAY - GOODBYE!!!")
break
#Build a while loop for initial question prompt (when Y or N is not True):
while user_input != "Y" and user_input != "N":
user_input = input('List AFE numbers? Y or N: ').upper()
if user_input not in ["Y","N"]:
print('"',user_input,'"','is an invalid input')
get_afenumbers(user_input)
#FIGURE OUT HOW TO EXTRACT FROM ALL PAGES AND NOT JUST ONE
I'm quite new to this, just learned about regex by a response to my question earlier today. Thanks for any help.

If you change a little, it seems works fine.
page_content="" # define variable for using in loop.
for page_number in range(number_of_pages):
page = read_pdf.getPage(page_number)
page_content += page.extractText() # concate reading pages.

os.walk set start and end point - python

I'm trying to find how to stop a os.walk after it has walked through a particular file.
I have a directory of log files organized by date. I'm trying to replace grep searches allowing a user to find ip addresses stored in a date range they specify.
The program will take the following arguments:
-i ipv4 or ipv6 address with subnet
-s start date ie 2013/12/20 matches file structure
-e end date
I'm assuming because the topdown option their is a logic that should allow me to declare a endpoint, what is the best way to do this? I'm thinking while loop.
I apologize in advance if something is off with my question. Just checked blood sugar, it's low 56, gd type one.
Additional information
The file structure will be situated in flows/index_border as such
2013
--01
--02
----01
----...
----29
2014
___________Hope this is clear, year folder contains month folders, containing day folders, containing hourly files. Dates increase downwards.___________________
The end date will need to be inclusive, ( I didn't focus too much on it because I can just add code to move one day up)
I have been trying to make a date range function, I was surprised I didn't see this in any datetime docs, seems like it would be useful.
import os, gzip, netaddr, datetime, argparse
startDir = '.'
def sdate_format(s):
try:
return (datetime.datetime.strptime(s, '%Y/%m/%d').date())
except ValueError:
msg = "Bad start date. Please use yyyy/mm/dd format."
raise argparse.ArgumentTypeError(msg)
def edate_format(e):
try:
return (datetime.datetime.strptime(e, '%Y/%m/%d').date())
except ValueError:
msg = "Bad end date. Please use yyyy/mm/dd format."
raise argparse.ArgumentTypeError(msg)
parser = argparse.ArgumentParser(description='Locate IP address in log files for a particular date or date range')
parser.add_argument('-s', '--start_date', action='store', type=sdate_format, dest='start_date', help='The first date in range of interest.')
parser.add_argument('-e', '--end_date', action='store', type=edate_format, dest='end_date', help='The last date in range of interest.')
parser.add_argument('-i', action='store', dest='net', help='IP address or address range, IPv4 or IPv6 with optional subnet accepted.', required=True)
results = parser.parse_args()
start = results.start_date
end = results.end_date
target_ip = results.net
startDir = '/flows/index_border/{0}/{1:02d}/{2:02d}'.format(start.year, start.month, start.day)
print('searching...')
for root, dirs, files in os.walk(startDir):
for contents in files:
if contents.endswith('.gz'):
f = gzip.open(os.path.join(root, contents), 'r')
else:
f = open(os.path.join(root, contents), 'r')
text = f.readlines()
f.close()
for line in text:
for address_item in netaddr.IPNetwork(target_IP):
if str(address_item) in line:
print line,

You need to describe what works or does not work. The argparse of your code looks fine, though I haven't done any testing. The use of type is refreshingly correct. :) (posters often misuse that parameter.)
But as for the stopping, I'm guessing you could do:
endDir = '/flows/index_border/{0}/{1:02d}/{2:02d}'.format(end.year, end.month, end.day)
for root, dirs, files in os.walk(startDir):
for contents in files:
....
if endDir in <something based on dirs and files>:
break
I don't know enough your file structure to be more specific. It's also been sometime since I worked with os.walk. In any case, I think a conditional break is the way to stop the walk early.

#!/usr/bin/env python
import os, gzip, netaddr, datetime, argparse, sys
searchDir = '.'
searchItems = []
def sdate_format(s):
try:
return (datetime.datetime.strptime(s, '%Y/%m/%d').date())
except ValueError:
msg = "Bad start date. Please use yyyy/mm/dd format."
raise argparse.ArgumentTypeError(msg)
def edate_format(e):
try:
return (datetime.datetime.strptime(e, '%Y/%m/%d').date())
except ValueError:
msg = "Bad end date. Please use yyyy/mm/dd format."
raise argparse.ArgumentTypeError(msg)
parser = argparse.ArgumentParser(description='Locate IP address in log files for a particular date or date range')
parser.add_argument('-s', '--start_date', action='store', type=sdate_format, dest='start_date',
help='The first date in range of interest.', required=True)
parser.add_argument('-e', '--end_date', action='store', type=edate_format, dest='end_date',
help='The last date in range of interest.', required=True)
parser.add_argument('-i', action='store', dest='net',
help='IP address or address range, IPv4 or IPv6 with optional subnet accepted.', required=True)
results = parser.parse_args()
start = results.start_date
end = results.end_date + datetime.timedelta(days=1)
target_IP = results.net
dateRange = end - start
for addressOfInterest in(netaddr.IPNetwork(target_IP)):
searchItems.append(str(addressOfInterest))
print('searching...')
for eachDay in range(dateRange.days):
period = start+datetime.timedelta(days=eachDay)
searchDir = '/flows/index_border/{0}/{1:02d}/{2:02d}'.format(period.year, period.month, period.day)
for contents in os.listdir(searchDir):
if contents.endswith('.gz'):
f = gzip.open(os.path.join(searchDir, contents), 'rb')
text = f.readlines()
f.close()
else:
f = open(os.path.join(searchDir, contents), 'r')
text = f.readlines()
f.close()
#for line in text:
# break
for addressOfInterest in searchItems:
for line in text:
if addressOfInterest in line:
# if str(address_item) in line:
print contents
print line,
I was banging my head, because I thought I was printing a duplicate. Turns out the file I was given to test has duplication. I ended up removing os.walk due to the predictable nature of the file system, but #hpaulj did provide a correct solution. Much appreciated!

Mutliple output files created but empty

I am trying to split one file with two articles in it into two separate files with one article in each, for subsequent analysis of the articles. Each article in the initial file has an ID that I want to use to separate the files with, using RE.
Below is the initial input file, with ID number:
166068619 #### "Epilepsy: let's end our ignorance of this neglected condition
Helen Stephens is a young woman with epilepsy [...]."
106899978 #### "Great British Payoff shows that BBC governance is broken
If it was a television series, they'd probably call it [...]."
However, when I run my code, I do get two separate files as an output but they are empty.
This is my code:
def file_split(path_to_file):
"""Function splits bigger file into N smaller ones, based on a certain RE
match, that is used to break the bigger file into smaller ones"""
def pattern_extract(path_to_file):
"""Function identifies the number of RE occurences in a file,
No. can be used in further analysis as range No."""
import re
x = []
with open(path_to_file) as f:
for line in f:
match = re.search(r'^\d+?\t####\t', line)
if match:
a = match.group()
x.append(a)
return len(x)
y = pattern_extract(path_to_file)
m = y + 1
files = [open('filename%i.txt' %i, 'w') for i in range(1,m)]
with open(path_to_file) as f:
for line in f:
match = re.search(r'^\d+?\t####\t', line)
if match:
a = match.group()
#files = [open('filename%i.txt' %i, 'w') for i in range(1, m)]
files[i-1].write(a)
for f in files:
f.close()
return files
Output result is as follows:
file_split(path)
Out[19]:
[<open file 'filename1.txt', mode 'w' at 0x7fe121b130c0>,
<open file 'filename2.txt', mode 'w' at 0x7fe121b131e0>]
I am new to Python and I am not quite sure where the problem lies. I checked some other answers that addressed the multiple file outputs but cannot figure out the solution. Help would be very much appreciated.

There are two problems with your code:
you write only the line matching the ID (actually, just the match itself), not the rest
you are always writing to the last file, as you use i, the loop variable "left over" from the list comprehension
To fix it, you could change the lower portion of your code to this:
y = pattern_extract(path_to_file)
files = [open('filename%i.txt' %i, 'w') for i in range(y)]
n = -1
with open(path_to_file) as f:
for line in f:
if re.search(r'^\d+\s+####\s+', line):
n += 1
files[n].write(line)
But you do not have to read the file two times at all, just to count the matches: Just open another file when the line matches an ID line and directly write to that last file in the list, then close all the files.
open_files = []
with open(path_to_file) as f:
for line in f:
if re.search(r'^\d+\s+####\s+', line):
open_files.append(open('filename%d.txt' % len(open_files), 'w'))
open_files[-1].write(line)
for f in open_files:
f.close()

Python 2.7.3: Search/Count txt file for string, return full line with final occurrence of that string

I'm trying to create a WiFi Log Scanner. Currently we go through logs manually using CTRL+F and our keywords. I just want to automate that process. i.e. bang in a .txt file and receive an output.
I've got the bones of the code, can work on making it pretty later, but I'm running into a small issue. I want the scanner to search the file (done), count instances of that string (done) and output the number of occurrences (done) followed by the full line where that string occurred last, including line number (line number is not essential, just makes things easier to do a gestimate of which is the more recent issue if there are multiple).
Currently I'm getting an output of every line with the string in it. I know why this is happening, I just can't think of a way to specify just output the last line.
Here is my code:
import os
from Tkinter import Tk
from tkFileDialog import askopenfilename
def file_len(filename):
#Count Number of Lines in File and Output Result
with open(filename) as f:
for i, l in enumerate(f):
pass
print('There are ' + str(i+1) + ' lines in ' + os.path.basename(filename))
def file_scan(filename):
#All Issues to Scan will go here
print ("DHCP was found " + str(filename.count('No lease, failing')) + " time(s).")
for line in filename:
if 'No lease, failing' in line:
print line.strip()
DNS= (filename.count('Host name lookup failure:res_nquery failed') + filename.count('HTTP query failed'))/2
print ("DNS Failure was found " + str(DNS) + " time(s).")
for line in filename:
if 'Host name lookup failure:res_nquery failed' or 'HTTP query failed' in line:
print line.strip()
print ("PSK= was found " + str(testr.count('psk=')) + " time(s).")
for line in ln:
if 'psk=' in line:
print 'The length(s) of the PSK used is ' + str(line.count('*'))
Tk().withdraw()
filename=askopenfilename()
abspath = os.path.abspath(filename) #So that doesn't matter if File in Python Dir
dname = os.path.dirname(abspath) #So that doesn't matter if File in Python Dir
os.chdir(dname) #So that doesn't matter if File in Python Dir
print ('Report for ' + os.path.basename(filename))
file_len(filename)
file_scan(filename)
That's, pretty much, going to be my working code (just have to add a few more issue searches), I have a version that searches a string instead of a text file here. This outputs the following:
Total Number of Lines: 38
DHCP was found 2 time(s).
dhcp
dhcp
PSK= was found 2 time(s).
The length(s) of the PSK used is 14
The length(s) of the PSK used is 8
I only have general stuff there, modified for it being a string rather than txt file, but the string I'm scanning from will be what's in the txt files.
Don't worry too much about PSK, I want all examples of that listed, I'll see If I can tidy them up into one line at a later stage.
As a side note, a lot of this is jumbled together from doing previous searches, so I have a good idea that there are probably neater ways of doing this. This is not my current concern, but if you do have a suggestion on this side of things, please provide an explanation/link to explanation as to why your way is better. I'm fairly new to python, so I'm mainly dealing with stuff I currently understand. :)
Thanks in advance for any help, if you need any further info, please let me know.
Joe

To search and count the string occurrence I solved in following way
'''---------------------Function--------------------'''
#Counting the "string" occurrence in a file
def count_string_occurrence():
string = "test"
f = open("result_file.txt")
contents = f.read()
f.close()
print "Number of '" + string + "' in file", contents.count("foo")
#we are searching "foo" string in file "result_file.txt"

I can't comment yet on questions, but I think I can answer more specifically with some more information What line do you want only one of?
For example, you can do something like:
search_str = 'find me'
count = 0
for line in file:
if search_str in line:
last_line = line
count += 1
print '{0} occurrences of this line:\n{1}'.format(count, last_line)
I notice that in file_scan you are iterating twice through file. You can surely condense it into one iteration :).

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Searching for a string in a file with python - python-2.7

Related

u'String' parsing csv file into dict Python

How to get PyPDF2 to extract text from multiple sequential pages - in range?

os.walk set start and end point - python

Mutliple output files created but empty

Python 2.7.3: Search/Count txt file for string, return full line with final occurrence of that string

Categories

Resources