readlines() function and unicodes - python-2.7

I have this file, testpi.txt, which i'd like to convert into a list of sentences.
>>>cat testpi.txt
This is math π.
That is moth pie.
Here's what I've done:
r = open('testpi.txt', 'r')
sentence_List = r.readlines()
print sentence_List
And, when the output is sent to another text file - output.txt , this is how it looks like in output.txt:
['This is math \xcf\x80. That is moth pie.\n']
I tried codecs too, r = codecs.open('testpi.txt', 'r',encoding='utf-8'),
but the output then consists of a leading 'u' in all the entries.
How could I display this byte string - \xcf\x80 as π, in the output.txt
Please guide me, thanks.

The problem is you're printing the entire list which gives you an output format you don't want. Instead, print each string individually and it will work:
r = open('t.txt', 'r')
sentence_List = r.readlines()
for line in sentence_List:
print line,
Or:
print "['{}']".format("', '".join(map(str.rstrip, sentence_List)))

Related

Python Outputting Text in Hex

I'm working with a very large text file (58GB) that I'm attempting to split into smaller chunks. The problem that I'm running into is that the smaller chunks appear to be Hex. I'm having my terminal print each line to stdout as well, but when I'm seeing it printed in stdout it's looking like normal strings to me. Is this known behavior? I've never encountered an issue where Python keeps spitting stuff out in Hex before. Even odder when I tried using Ubuntu's split from the command line it was also generating everything in Hex.
Code snippet below:
working_dir = '/SECRET/'
output_dir = path.join(working_dir, 'output')
test_file = 'SECRET.txt'
report_file = 'SECRET_REPORT.txt'
output_chunks = 100000000
output_base = 'SECRET'
input = open(test_file, 'r')
report_output = open(report_file, 'w')
count = 0
at_line = 0
output_f = None
for line in input:
if count % output_chunks == 0:
if output_f:
report_output.write('[{}] wrote {} lines to {}. Total count is {}'.format(
datetime.now(), output_chunks, str(output_base + str(at_line) + '.txt'), count))
output_f.close()
output_f = open('{}{}.txt'.format(output_base, str(at_line)), 'wb')
at_line += 1
output_f.write(line.encode('ascii', 'ignore'))
print line.encode('ascii', 'ignore')
count += 1
Here's what was going on:
Each line was started with a NUL character. When I was opening up parts of the file using head or PyCharm's terminal it was showing up normal, but when I was looking at my output in Sublime Text it was picking up on that NUL character and rendering the results in Hex. I had to strip '\x00' from each line of the output and it started looking the way I would expect it to

python file reading and splitting the words

I am reading a file in python and splitting the file with '\n' . when i am printing the splitted list it is giving 'Magni\xef\xac\x81cent Mary' instead of 'Magnificient Mary'
Here is my code...
with open('/home/naveen/Desktop/answer.txt') as ans:
content = ans.read()
content = content.split('\n')
print content
note: answer.txt contains following lines
Magnificent Mary
Flying Sikh
Payyoli Express
Here is my output of the program
the problem is in your text file. There are some unicodes characters in "Magnificent Mary" If you fix that your program should work. If you want to read with unicodes characters, you have to properly decode texts to UTF-8.
Have a look at this one (assuming you want to use python 2) Backporting Python 3 open(encoding="utf-8") to Python 2
python2
with codecs.open(filename='/Users/emily/Desktop/answers.txt', mode='rb', encoding='UTF-8') as ans:
content = ans.read().splitlines()
for i in content: print i
If you can use python3, you can actually do this:
with open('/home/naveen/Desktop/answer.txt', encoding='UTF-8') as ans:
content = ans.read().splitlines()
print(content)
There is a problem with your 'f' in Magnificent Mary . It is not the normal f , but it is the
LATIN SMALL LIGATURE FI . You can simply delete your 'f' and retype it in gedit.
To verify the difference , simply include
print [(ord(a),a) for a in (file.split("\n"))[0]]
at the end of your code for both the fs.
If there is no way to edit the file , you could first convert the string to unicode , and then use the unicodedata of python.
import unicodedata
file = open("answer.txt")
file = (file.read()).decode('utf-8')
print unicodedata.normalize('NFKD',
file).encode('ascii','ignore').split("\n")

Python:How can you recursively search a .txt file, find matches and print results

I have been searching for an answer to this, but can not seem to get what I need. I would like a python script that reads my text file and starting from the top working its way through each line of the file and then prints out all the matches in another txt file. Content of the text file is just 4 digit numbers like 1234.
example
1234
3214
4567
8963
1532
1234
...and so on.
I would like the output to be something like:
1234 : matches found = 2
I know that there are matches in the file do to almost 10000 lines. I appreciate any help. If someone could just point me in the right direction here would be great. Thank you.
import re
file = open("filename", 'r')
fileContent=file.read()
pattern="1234"
print len(re.findall(pattern,fileContent))
If I were you I would open the file and use the split method to create a list with all the numbers in and use the Counter method from collections to count how many of each number in the list are dupilcates.
`
from collections import Counter
filepath = 'original_file'
new_filepath = 'new_file'
file = open(filepath,'r')
text = file.read()
file.close()
numbers_list = text.split('\n')
numbers_set = set(numbers_list)
dupes = [[item,':matches found =',str(count)] for item,count in Counter(numbers_list).items() if count > 1]
dupes = [' '.join(i) for i in dupes]
new_file = open(new_filepath,'w')
for i in dupes:
new_file.write(i)
new_file.close()
`
Thanks to everyone who helped me on this. Thank you to #csabinho for the code he provided and to #IanAuld for asking me "Why do you think you need recursion here?" – IanAuld. It got me to thinking that the solution was a simple one. I just wanted to know which 4 digit numbers had duplicates and how many, and also which 4 digit combos were unique. So this is what I came up with and it worked beautifully!
import re
a=999
while a <9999:
a = a+1
file = open("4digits.txt", 'r')
fileContent = file.read()
pattern = str(a)
result = len(re.findall(pattern, fileContent))
if result >= 1:
print(a,"matches",result)
else:
print (a,"This number is unique!")

Converting a text file into a dictionary in python

So, I'm a bit new to Python and I've come across the following problem in one of my codes:
I have a txt file with the following text:
Jolly 77777
Fargo 88888
Hunt 68548
I want to convert it into a dictionary with BOTH the name and number as keys... Here's what I have so far but I keep getting a traceback error and am not sure as to what error I am making. It's driving me nuts; Help?
This is what I have so far:
filename = open("ident.txt","r")
dictionary={}
with open('ident.txt','r') as f:
for line in f.readlines():
a,b = line.split()
dictionary[a] = int(b)
You're close:
dictionary = {}
with open('ident.txt','r') as f:
for line in f:
a,b = line.split()
dictionary[a] = int(b)
That yields a dictionary value of:
{'Fargo': 88888, 'Hunt': 68548, 'Jolly': 77777}
FWIW, the line filename = open("ident.txt","r") isn't going to do you any favors, since filename will end up being an open file, not a filename. And you don't need f.readlines(). Files iterate fine on their own, line by line.

I need to write a Python stub to print names of image files and whether they are blurry or not

New user here, and just started Python a few days ago!
My question is:
I need to write a Python stub to print names of image files and whether they are blurry or not. They are considered blurry if the value is > 0.3. There are 5 bits of information in each line, the second bit (index 1) is the number in question. In total there are 1868 lines.
Here is a sample of the data:
['out04-32-44-03.tif,0.295554,536047.6051,5281850.4252,19.8091\n',
'out04-32-44-15.tif,0.337232,536047.2831,5281850.5974,19.8256\n',
'out04-32-44-27.tif,0.2984,536046.9611,5281850.7696,19.8420\n',
'out04-32-44-39.tif,0.311989,536046.6392,5281850.9418,19.8584\n',
'out04-32-44-51.tif,0.346901,536046.3172,5281851.1140,19.8749\n',
'out04-32-44-63.tif,0.358519,536045.9953,5281851.2862,19.8913\n',
'out04-32-44-75.tif,0.342837,536045.6733,5281851.4584,19.9078\n',
'out04-32-44-87.tif,0.32909,536045.3513,5281851.6306,19.9242\n',
'out04-32-44-99.tif,0.294824,536045.0294,5281851.8028,19.9406\n']
Any suggestions greatly appreciated :-)
Based on the code you have written in the comments. This is for python 2.7
fin = open('E:\KGG 375 - GIS Advanced\Assignment 2 - Python\TIR043109gpxpos.txt')
for line in fin: # no need to read these into a list first
info = line.split(',')
blurry = float(info[1])
print info[0],
if blurry > 0.3:
print ' is blurry'
else:
print ' is not blurry'
Explanation:
There is no need to read the lines of a file to a list, you can just iterate over a file and it will read line by line
To be able to compare against a float, you need to convert the 2nd element (info[1]) into a float.
print info[0], will print the filename and the comma will prevent a line break so " is blurry" will print out to the same line. HOX! This is python2.7 syntax so it will not work with python 3.x