I'm working with a very large text file (58GB) that I'm attempting to split into smaller chunks. The problem that I'm running into is that the smaller chunks appear to be Hex. I'm having my terminal print each line to stdout as well, but when I'm seeing it printed in stdout it's looking like normal strings to me. Is this known behavior? I've never encountered an issue where Python keeps spitting stuff out in Hex before. Even odder when I tried using Ubuntu's split from the command line it was also generating everything in Hex.
Code snippet below:
working_dir = '/SECRET/'
output_dir = path.join(working_dir, 'output')
test_file = 'SECRET.txt'
report_file = 'SECRET_REPORT.txt'
output_chunks = 100000000
output_base = 'SECRET'
input = open(test_file, 'r')
report_output = open(report_file, 'w')
count = 0
at_line = 0
output_f = None
for line in input:
if count % output_chunks == 0:
if output_f:
report_output.write('[{}] wrote {} lines to {}. Total count is {}'.format(
datetime.now(), output_chunks, str(output_base + str(at_line) + '.txt'), count))
output_f.close()
output_f = open('{}{}.txt'.format(output_base, str(at_line)), 'wb')
at_line += 1
output_f.write(line.encode('ascii', 'ignore'))
print line.encode('ascii', 'ignore')
count += 1
Here's what was going on:
Each line was started with a NUL character. When I was opening up parts of the file using head or PyCharm's terminal it was showing up normal, but when I was looking at my output in Sublime Text it was picking up on that NUL character and rendering the results in Hex. I had to strip '\x00' from each line of the output and it started looking the way I would expect it to
Related
I am reading a text file in python(500 rows) and it seems like:
File Input:
0082335401
0094446049
01008544409
01037792084
01040763890
I wanted to ask that is it possible to insert one space after 5th Character in each line:
Desired Output:
00823 35401
00944 46049
01008 544409
01037 792084
01040 763890
I have tried below code
st = " ".join(st[i:i + 5] for i in range(0, len(st), 5))
but the below output was returned on executing it:
00823 35401
0094 44604 9
010 08544 409
0 10377 92084
0104 07638 90
I am a novice in Python. Any help would make a difference.
There seems to be two issues here - By running your provided code, you seem to be reading the file into one single string. It would be much preferable (in your case) to read the file in as a list of strings, like the following (assuming your input file is input_data.txt):
# Initialize a list for the data to be stored
data = []
# Iterate through your file to read the data
with open("input_data.txt") as f:
for line in f.readlines():
# Use .rstrip() to get rid of the newline character at the end
data.append(line.rstrip("\r\n"))
Then, to operate on the data you obtained in a list, you could use a list comprehension similar to the one you have tried to use.
# Assumes that data is the result from the above code
data = [i[:5] + " " + i[5:] if len(i) > 5 else i for i in data]
Hope this helped!
If your only requirement is to insert a space after the fifth character than you could use the following simple version:
#!/usr/bin/env python
with open("input_data") as data:
for line in data.readlines():
line = line.rstrip()
if len(line) > 5:
print(line[0:5]+" "+line[5:])
else:
print(line)
If you don't mind if lines with less than five characters get a space at the end, you could even omit the if-else-statement and go with the print-function from the if-clause:
#!/usr/bin/env python
with open("input_data") as data:
for line in data.readlines():
line = line.rstrip()
print(line[0:5]+" "+line[5:])
I'm writing a small python script which takes a log file, matches strings within it and saves them and another custom string "goal " to another text file. Then I take some values from the second file and add them to a list. The problem is that depending on the length of the custom string (e.g. "goalgoalgoal ") the lists with the values varies in length - currently, I'm working with a log file which includes 1031 matches of the string "goal ", but the length of list varies from everything between ~980 and 1029.
Here is the code:
for line in inputfile:
if "Started---" in line:
startTime = line[11:23]
testfile.write("\n"+"Start"+"\n"+"goal "+ startTime+"\n")
counterLines +=1
elif "done!" in line:
testfile.write("\n"+find_between(line, "| ", "done!")+"\n")
elif "Errors:" in line:
testfile.write("\n"+"Errors:"+line.split("Errors:",1)[1]+"\n")
elif "Warnings:" in line:
testfile.write("\n"+"Warnings:"+line.split("Warnings:",1)[1]+"\n")
elif "Successes:" in line:
testfile.write("\n"+"Successes:"+line.split("Successes:",1)[1]+"\n")
elif "END---" in line:
endTime = line[11:23]
testfile.write("\n"+"End"+"\n"+"endTime "+ endTime+"\n")
else:
print("nothing found")
testfileread = open(filePath+"\\testFile.txt", "r")
startTimesList = []
endTimesList = []
for line in testfileread:
matchObj = re.match(r'goal', line)
if matchObj:
startTimesList.append(line)
print(len(startTimesList))
Do you have ideas why the code doesn't work as expected?
Thank you in advance!
Most probably it's due to the fact that you don't flush testFile.txt after writing is completed - as a result, there is unpredictable amount of data in the file when you start reading it. Calling testfile.flush() should fix the problem. Alternatively, wrap the writing logic in a with block.
I have been searching for an answer to this, but can not seem to get what I need. I would like a python script that reads my text file and starting from the top working its way through each line of the file and then prints out all the matches in another txt file. Content of the text file is just 4 digit numbers like 1234.
example
1234
3214
4567
8963
1532
1234
...and so on.
I would like the output to be something like:
1234 : matches found = 2
I know that there are matches in the file do to almost 10000 lines. I appreciate any help. If someone could just point me in the right direction here would be great. Thank you.
import re
file = open("filename", 'r')
fileContent=file.read()
pattern="1234"
print len(re.findall(pattern,fileContent))
If I were you I would open the file and use the split method to create a list with all the numbers in and use the Counter method from collections to count how many of each number in the list are dupilcates.
`
from collections import Counter
filepath = 'original_file'
new_filepath = 'new_file'
file = open(filepath,'r')
text = file.read()
file.close()
numbers_list = text.split('\n')
numbers_set = set(numbers_list)
dupes = [[item,':matches found =',str(count)] for item,count in Counter(numbers_list).items() if count > 1]
dupes = [' '.join(i) for i in dupes]
new_file = open(new_filepath,'w')
for i in dupes:
new_file.write(i)
new_file.close()
`
Thanks to everyone who helped me on this. Thank you to #csabinho for the code he provided and to #IanAuld for asking me "Why do you think you need recursion here?" – IanAuld. It got me to thinking that the solution was a simple one. I just wanted to know which 4 digit numbers had duplicates and how many, and also which 4 digit combos were unique. So this is what I came up with and it worked beautifully!
import re
a=999
while a <9999:
a = a+1
file = open("4digits.txt", 'r')
fileContent = file.read()
pattern = str(a)
result = len(re.findall(pattern, fileContent))
if result >= 1:
print(a,"matches",result)
else:
print (a,"This number is unique!")
I have this file, testpi.txt, which i'd like to convert into a list of sentences.
>>>cat testpi.txt
This is math π.
That is moth pie.
Here's what I've done:
r = open('testpi.txt', 'r')
sentence_List = r.readlines()
print sentence_List
And, when the output is sent to another text file - output.txt , this is how it looks like in output.txt:
['This is math \xcf\x80. That is moth pie.\n']
I tried codecs too, r = codecs.open('testpi.txt', 'r',encoding='utf-8'),
but the output then consists of a leading 'u' in all the entries.
How could I display this byte string - \xcf\x80 as π, in the output.txt
Please guide me, thanks.
The problem is you're printing the entire list which gives you an output format you don't want. Instead, print each string individually and it will work:
r = open('t.txt', 'r')
sentence_List = r.readlines()
for line in sentence_List:
print line,
Or:
print "['{}']".format("', '".join(map(str.rstrip, sentence_List)))
I am a python newbie. I can print the twitter search results, but when I save to .txt, I only get one result. How do I add all the results to my .txt file?
t = Twython(app_key=api_key, app_secret=api_secret, oauth_token=acces_token, oauth_token_secret=ak_secret)
tweets = []
MAX_ATTEMPTS = 10
COUNT_OF_TWEETS_TO_BE_FETCHED = 500
for i in range(0,MAX_ATTEMPTS):
if(COUNT_OF_TWEETS_TO_BE_FETCHED < len(tweets)):
break
if(0 == i):
results = t.search(q="#twitter",count='100')
else:
results = t.search(q="#twitter",include_entities='true',max_id=next_max_id)
for result in results['statuses']:
tweet_text = result['user']['screen_name'], result['user']['followers_count'], result['text'], result['created_at'], result['source']
tweets.append(tweet_text)
print tweet_text
text_file = open("Output.txt", "w")
text_file.write("#%s,%s,%s,%s,%s" % (result['user']['screen_name'], result['user']['followers_count'], result['text'], result['created_at'], result['source']))
text_file.close()
You just need to rearrange your code to open the file BEFORE you do the loop:
t = Twython(app_key=api_key, app_secret=api_secret, oauth_token=acces_token, oauth_token_secret=ak_secret)
tweets = []
MAX_ATTEMPTS = 10
COUNT_OF_TWEETS_TO_BE_FETCHED = 500
with open("Output.txt", "w") as text_file:
for i in range(0,MAX_ATTEMPTS):
if(COUNT_OF_TWEETS_TO_BE_FETCHED < len(tweets)):
break
if(0 == i):
results = t.search(q="#twitter",count='100')
else:
results = t.search(q="#twitter",include_entities='true',max_id=next_max_id)
for result in results['statuses']:
tweet_text = result['user']['screen_name'], result['user']['followers_count'], result['text'], result['created_at'], result['source']
tweets.append(tweet_text)
print tweet_text
text_file.write("#%s,%s,%s,%s,%s" % (result['user']['screen_name'], result['user']['followers_count'], result['text'], result['created_at'], result['source']))
text_file.write('\n')
I use Python's with statement here to open a context manager. The context manager will handle closing the file when you drop out of the loop. I also added another write command that writes out a carriage return so that each line of data would be on its own line.
You could also open the file in append mode ('a' instead of 'w'), which would allow you to remove the 2nd write command.
There are two general solutions to your issue. Which is best may depend on more details of your program.
The simplest solution is just to open the file once at the top of your program (before the loop) and then keep reusing the same file object over and over in the later code. Only when the whole loop is done should the file be closed.
with open("Output.txt", "w") as text_file:
for i in range(0,MAX_ATTEMPTS):
# ...
for result in results['statuses']:
# ...
text_file.write("#%s,%s,%s,%s,%s" % (result['user']['screen_name'],
result['user']['followers_count'],
result['text'],
result['created_at'],
result['source']))
Another solution would be to open the file several times, but to use the "a" append mode when you do so. Append mode does not truncate the file like "w" write mode does, and it seeks to the end automatically, so you don't overwrite the file's existing contents. This approach would be most appropriate if you were writing to several different files. If you're just writing to the one, I'd stick with the first solution.
for i in range(0,MAX_ATTEMPTS):
# ...
for result in results['statuses']:
# ...
with open("Output.txt", "a") as text_file:
text_file.write("#%s,%s,%s,%s,%s" % (result['user']['screen_name'],
result['user']['followers_count'],
result['text'],
result['created_at'],
result['source']))
One last point: It looks like you're writing out comma separated data. You may want to use the csv module, rather than writing your file manually. It can take care of things like quoting or escaping any commas that appear in the data for you.