Python v3 inconsistent regex match returns - regex

I'm writing a small python script which takes a log file, matches strings within it and saves them and another custom string "goal " to another text file. Then I take some values from the second file and add them to a list. The problem is that depending on the length of the custom string (e.g. "goalgoalgoal ") the lists with the values varies in length - currently, I'm working with a log file which includes 1031 matches of the string "goal ", but the length of list varies from everything between ~980 and 1029.
Here is the code:
for line in inputfile:
if "Started---" in line:
startTime = line[11:23]
testfile.write("\n"+"Start"+"\n"+"goal "+ startTime+"\n")
counterLines +=1
elif "done!" in line:
testfile.write("\n"+find_between(line, "| ", "done!")+"\n")
elif "Errors:" in line:
testfile.write("\n"+"Errors:"+line.split("Errors:",1)[1]+"\n")
elif "Warnings:" in line:
testfile.write("\n"+"Warnings:"+line.split("Warnings:",1)[1]+"\n")
elif "Successes:" in line:
testfile.write("\n"+"Successes:"+line.split("Successes:",1)[1]+"\n")
elif "END---" in line:
endTime = line[11:23]
testfile.write("\n"+"End"+"\n"+"endTime "+ endTime+"\n")
else:
print("nothing found")
testfileread = open(filePath+"\\testFile.txt", "r")
startTimesList = []
endTimesList = []
for line in testfileread:
matchObj = re.match(r'goal', line)
if matchObj:
startTimesList.append(line)
print(len(startTimesList))
Do you have ideas why the code doesn't work as expected?
Thank you in advance!

Most probably it's due to the fact that you don't flush testFile.txt after writing is completed - as a result, there is unpredictable amount of data in the file when you start reading it. Calling testfile.flush() should fix the problem. Alternatively, wrap the writing logic in a with block.

Related

matching an entire list with each and every line of file

I had written a piece of code that basically performs find and replace from a list on a text file.
So, it maps the entire list into a dictionary. Then from text file each and every line is processed and is matched with entire list in the dictionary if a match anywhere in the line is found it replaces with corresponding value from the list(dictionary).
Here is the code:
import sys
import re
#open file using open file mode
fp1 = open(sys.argv[1]) # Open file on read mode
lines = fp1.read().split("\n") # Create a list containing all lines
fp1.close() # Close file
fp2 = open(sys.argv[2]) # Open file on read mode
words = fp2.read().split("\n") # Create a list containing all lines
fp2.close() # Close file
word_hash = {}
for word in words:
#print(word)
if(word != ""):
tsl = word.split("\t")
word_hash[tsl[0]] = tsl[1]
#print(word_hash)
keys = word_hash.keys()
#skeys = sorted(keys, key=lambda x:x.split(" "),reverse=True)
#print(keys)
#print (skeys)
for line in lines:
if(line != ""):
for key in keys:
#my_regex = key + r"\b"
my_regex = r"([\"\( ])" + key + r"([ ,\.!\"।)])"
#print(my_regex)
if((re.search(my_regex, line, re.IGNORECASE|re.UNICODE))):
line = re.sub(my_regex, r"\1" + word_hash[key]+r"\2",line,flags=re.IGNORECASE|re.UNICODE|re.MULTILINE)
#print("iam :1",line)
if((re.search(key + r"$", line, re.IGNORECASE|re.UNICODE))):
line = re.sub(key+r"$", word_hash[key],line,flags=re.IGNORECASE|re.UNICODE|re.MULTILINE)
#print("iam :2",line)
if((re.search(r"^" + key, line, re.IGNORECASE|re.UNICODE))):
#print(line)
line = re.sub(r"^" + key, word_hash[key],line,flags=re.IGNORECASE|re.UNICODE|re.MULTILINE)
#print("iam :",line)
print(line)
else:
print(line)
Problem here is when the list size grows execution slows up as all the lines of text file are matched with each and every key in list. So where can I improve the execution of this code.
List file:
word1===>replaceword1
word2===>replaceword2
.....
List is tab seperated. Here I used ===> for easy understanding.
Input file:
hello word1 I am here.
word2. how are you word1?
Expected Output:
hello replaceword1 I am here.
replaceword2. how are you replaceword1?
If your word list is small enough, the best speedup you can achieve with the match-and-replace process is to use a single big regexp and use a functionnal re.sub
This way you have a single call to the optimised function.
EDIT: In order to preserve order of replacements (this can lead to chain replacing, don't know if intended behavior) we can perform replacement by batches rather than in a single run, where batches order respects file order and each batch is made of disjoint possible string matches.
The code would be as follow
import sys
import re
word_hashes = []
def insert_word(word, replacement, hashes):
if not hashes:
return [{word: replacement}]
for prev_word in hashes[0]:
if word in prev_word or prev_word in word:
return [hashes[0]] + insert_word(word, replacement, hashes[1:])
hashes[0][word] = replacement
return hashes
with open(sys.argv[2]) as fp2: # Open file on read mode
words = fp2.readlines()
for word in [w.strip() for w in words if w.strip()]:
tsl = word.split("\t")
word_hashes = insert_word(tsl[0],tsl[1], word_hashes)
#open file using open file mode
lines = []
with open(sys.argv[1]) as fp1:
content = fp1.read()
for word_hash in word_hashes:
my_regex = r"([\"\( ])(" + '|'.join(word_hash.keys()) + r")([ ,\.!\"।)])"
content = re.sub(my_regex, lambda x: x.group(1) + word_hash[x.group(2)] + x.group(3) ,content,flags=re.IGNORECASE|re.UNICODE|re.MULTILINE)
print(content)
We obtain chained replacement for the example data. For example, with the following words to replace
roses are red==>flowers are blue
are==>is
Text to parse
roses are red and beautiful
flowers are yellow
Output
roses is red and beautiful
flowers is yellow
Why don't you read the content of the entire file in a string, and just do string.replace. For example.
def find_replace():
txt = ''
#Read text from the file as a string
with open('file.txt', 'r') as fp:
txt = fp.read()
dct = {"word1":"replaceword1","word2":"replaceword2"}
#Find and replace characters
for k,v in dct.items():
txt = txt.replace(k,v)
#Write back the modified string
with open('file.txt', 'w') as fp:
fp.write(txt)
If the input file is:
hello word1 I am here.
word2. how are you word1?
The output will be:
hello replaceword1 I am here.
replaceword2. how are you replaceword1?

Python Outputting Text in Hex

I'm working with a very large text file (58GB) that I'm attempting to split into smaller chunks. The problem that I'm running into is that the smaller chunks appear to be Hex. I'm having my terminal print each line to stdout as well, but when I'm seeing it printed in stdout it's looking like normal strings to me. Is this known behavior? I've never encountered an issue where Python keeps spitting stuff out in Hex before. Even odder when I tried using Ubuntu's split from the command line it was also generating everything in Hex.
Code snippet below:
working_dir = '/SECRET/'
output_dir = path.join(working_dir, 'output')
test_file = 'SECRET.txt'
report_file = 'SECRET_REPORT.txt'
output_chunks = 100000000
output_base = 'SECRET'
input = open(test_file, 'r')
report_output = open(report_file, 'w')
count = 0
at_line = 0
output_f = None
for line in input:
if count % output_chunks == 0:
if output_f:
report_output.write('[{}] wrote {} lines to {}. Total count is {}'.format(
datetime.now(), output_chunks, str(output_base + str(at_line) + '.txt'), count))
output_f.close()
output_f = open('{}{}.txt'.format(output_base, str(at_line)), 'wb')
at_line += 1
output_f.write(line.encode('ascii', 'ignore'))
print line.encode('ascii', 'ignore')
count += 1
Here's what was going on:
Each line was started with a NUL character. When I was opening up parts of the file using head or PyCharm's terminal it was showing up normal, but when I was looking at my output in Sublime Text it was picking up on that NUL character and rendering the results in Hex. I had to strip '\x00' from each line of the output and it started looking the way I would expect it to

In python insert one space after every 5th Character in each line of a text file

I am reading a text file in python(500 rows) and it seems like:
File Input:
0082335401
0094446049
01008544409
01037792084
01040763890
I wanted to ask that is it possible to insert one space after 5th Character in each line:
Desired Output:
00823 35401
00944 46049
01008 544409
01037 792084
01040 763890
I have tried below code
st = " ".join(st[i:i + 5] for i in range(0, len(st), 5))
but the below output was returned on executing it:
00823 35401
0094 44604 9
010 08544 409
0 10377 92084
0104 07638 90
I am a novice in Python. Any help would make a difference.
There seems to be two issues here - By running your provided code, you seem to be reading the file into one single string. It would be much preferable (in your case) to read the file in as a list of strings, like the following (assuming your input file is input_data.txt):
# Initialize a list for the data to be stored
data = []
# Iterate through your file to read the data
with open("input_data.txt") as f:
for line in f.readlines():
# Use .rstrip() to get rid of the newline character at the end
data.append(line.rstrip("\r\n"))
Then, to operate on the data you obtained in a list, you could use a list comprehension similar to the one you have tried to use.
# Assumes that data is the result from the above code
data = [i[:5] + " " + i[5:] if len(i) > 5 else i for i in data]
Hope this helped!
If your only requirement is to insert a space after the fifth character than you could use the following simple version:
#!/usr/bin/env python
with open("input_data") as data:
for line in data.readlines():
line = line.rstrip()
if len(line) > 5:
print(line[0:5]+" "+line[5:])
else:
print(line)
If you don't mind if lines with less than five characters get a space at the end, you could even omit the if-else-statement and go with the print-function from the if-clause:
#!/usr/bin/env python
with open("input_data") as data:
for line in data.readlines():
line = line.rstrip()
print(line[0:5]+" "+line[5:])

Save multiple lines of text in .txt

I am a python newbie. I can print the twitter search results, but when I save to .txt, I only get one result. How do I add all the results to my .txt file?
t = Twython(app_key=api_key, app_secret=api_secret, oauth_token=acces_token, oauth_token_secret=ak_secret)
tweets = []
MAX_ATTEMPTS = 10
COUNT_OF_TWEETS_TO_BE_FETCHED = 500
for i in range(0,MAX_ATTEMPTS):
if(COUNT_OF_TWEETS_TO_BE_FETCHED < len(tweets)):
break
if(0 == i):
results = t.search(q="#twitter",count='100')
else:
results = t.search(q="#twitter",include_entities='true',max_id=next_max_id)
for result in results['statuses']:
tweet_text = result['user']['screen_name'], result['user']['followers_count'], result['text'], result['created_at'], result['source']
tweets.append(tweet_text)
print tweet_text
text_file = open("Output.txt", "w")
text_file.write("#%s,%s,%s,%s,%s" % (result['user']['screen_name'], result['user']['followers_count'], result['text'], result['created_at'], result['source']))
text_file.close()
You just need to rearrange your code to open the file BEFORE you do the loop:
t = Twython(app_key=api_key, app_secret=api_secret, oauth_token=acces_token, oauth_token_secret=ak_secret)
tweets = []
MAX_ATTEMPTS = 10
COUNT_OF_TWEETS_TO_BE_FETCHED = 500
with open("Output.txt", "w") as text_file:
for i in range(0,MAX_ATTEMPTS):
if(COUNT_OF_TWEETS_TO_BE_FETCHED < len(tweets)):
break
if(0 == i):
results = t.search(q="#twitter",count='100')
else:
results = t.search(q="#twitter",include_entities='true',max_id=next_max_id)
for result in results['statuses']:
tweet_text = result['user']['screen_name'], result['user']['followers_count'], result['text'], result['created_at'], result['source']
tweets.append(tweet_text)
print tweet_text
text_file.write("#%s,%s,%s,%s,%s" % (result['user']['screen_name'], result['user']['followers_count'], result['text'], result['created_at'], result['source']))
text_file.write('\n')
I use Python's with statement here to open a context manager. The context manager will handle closing the file when you drop out of the loop. I also added another write command that writes out a carriage return so that each line of data would be on its own line.
You could also open the file in append mode ('a' instead of 'w'), which would allow you to remove the 2nd write command.
There are two general solutions to your issue. Which is best may depend on more details of your program.
The simplest solution is just to open the file once at the top of your program (before the loop) and then keep reusing the same file object over and over in the later code. Only when the whole loop is done should the file be closed.
with open("Output.txt", "w") as text_file:
for i in range(0,MAX_ATTEMPTS):
# ...
for result in results['statuses']:
# ...
text_file.write("#%s,%s,%s,%s,%s" % (result['user']['screen_name'],
result['user']['followers_count'],
result['text'],
result['created_at'],
result['source']))
Another solution would be to open the file several times, but to use the "a" append mode when you do so. Append mode does not truncate the file like "w" write mode does, and it seeks to the end automatically, so you don't overwrite the file's existing contents. This approach would be most appropriate if you were writing to several different files. If you're just writing to the one, I'd stick with the first solution.
for i in range(0,MAX_ATTEMPTS):
# ...
for result in results['statuses']:
# ...
with open("Output.txt", "a") as text_file:
text_file.write("#%s,%s,%s,%s,%s" % (result['user']['screen_name'],
result['user']['followers_count'],
result['text'],
result['created_at'],
result['source']))
One last point: It looks like you're writing out comma separated data. You may want to use the csv module, rather than writing your file manually. It can take care of things like quoting or escaping any commas that appear in the data for you.

Python 2.7.3: Search/Count txt file for string, return full line with final occurrence of that string

I'm trying to create a WiFi Log Scanner. Currently we go through logs manually using CTRL+F and our keywords. I just want to automate that process. i.e. bang in a .txt file and receive an output.
I've got the bones of the code, can work on making it pretty later, but I'm running into a small issue. I want the scanner to search the file (done), count instances of that string (done) and output the number of occurrences (done) followed by the full line where that string occurred last, including line number (line number is not essential, just makes things easier to do a gestimate of which is the more recent issue if there are multiple).
Currently I'm getting an output of every line with the string in it. I know why this is happening, I just can't think of a way to specify just output the last line.
Here is my code:
import os
from Tkinter import Tk
from tkFileDialog import askopenfilename
def file_len(filename):
#Count Number of Lines in File and Output Result
with open(filename) as f:
for i, l in enumerate(f):
pass
print('There are ' + str(i+1) + ' lines in ' + os.path.basename(filename))
def file_scan(filename):
#All Issues to Scan will go here
print ("DHCP was found " + str(filename.count('No lease, failing')) + " time(s).")
for line in filename:
if 'No lease, failing' in line:
print line.strip()
DNS= (filename.count('Host name lookup failure:res_nquery failed') + filename.count('HTTP query failed'))/2
print ("DNS Failure was found " + str(DNS) + " time(s).")
for line in filename:
if 'Host name lookup failure:res_nquery failed' or 'HTTP query failed' in line:
print line.strip()
print ("PSK= was found " + str(testr.count('psk=')) + " time(s).")
for line in ln:
if 'psk=' in line:
print 'The length(s) of the PSK used is ' + str(line.count('*'))
Tk().withdraw()
filename=askopenfilename()
abspath = os.path.abspath(filename) #So that doesn't matter if File in Python Dir
dname = os.path.dirname(abspath) #So that doesn't matter if File in Python Dir
os.chdir(dname) #So that doesn't matter if File in Python Dir
print ('Report for ' + os.path.basename(filename))
file_len(filename)
file_scan(filename)
That's, pretty much, going to be my working code (just have to add a few more issue searches), I have a version that searches a string instead of a text file here. This outputs the following:
Total Number of Lines: 38
DHCP was found 2 time(s).
dhcp
dhcp
PSK= was found 2 time(s).
The length(s) of the PSK used is 14
The length(s) of the PSK used is 8
I only have general stuff there, modified for it being a string rather than txt file, but the string I'm scanning from will be what's in the txt files.
Don't worry too much about PSK, I want all examples of that listed, I'll see If I can tidy them up into one line at a later stage.
As a side note, a lot of this is jumbled together from doing previous searches, so I have a good idea that there are probably neater ways of doing this. This is not my current concern, but if you do have a suggestion on this side of things, please provide an explanation/link to explanation as to why your way is better. I'm fairly new to python, so I'm mainly dealing with stuff I currently understand. :)
Thanks in advance for any help, if you need any further info, please let me know.
Joe
To search and count the string occurrence I solved in following way
'''---------------------Function--------------------'''
#Counting the "string" occurrence in a file
def count_string_occurrence():
string = "test"
f = open("result_file.txt")
contents = f.read()
f.close()
print "Number of '" + string + "' in file", contents.count("foo")
#we are searching "foo" string in file "result_file.txt"
I can't comment yet on questions, but I think I can answer more specifically with some more information What line do you want only one of?
For example, you can do something like:
search_str = 'find me'
count = 0
for line in file:
if search_str in line:
last_line = line
count += 1
print '{0} occurrences of this line:\n{1}'.format(count, last_line)
I notice that in file_scan you are iterating twice through file. You can surely condense it into one iteration :).