python file reading and splitting the words - python-2.7

I am reading a file in python and splitting the file with '\n' . when i am printing the splitted list it is giving 'Magni\xef\xac\x81cent Mary' instead of 'Magnificient Mary'
Here is my code...
with open('/home/naveen/Desktop/answer.txt') as ans:
content = ans.read()
content = content.split('\n')
print content
note: answer.txt contains following lines
Magnificent Mary
Flying Sikh
Payyoli Express
Here is my output of the program

the problem is in your text file. There are some unicodes characters in "Magnificent Mary" If you fix that your program should work. If you want to read with unicodes characters, you have to properly decode texts to UTF-8.
Have a look at this one (assuming you want to use python 2) Backporting Python 3 open(encoding="utf-8") to Python 2
python2
with codecs.open(filename='/Users/emily/Desktop/answers.txt', mode='rb', encoding='UTF-8') as ans:
content = ans.read().splitlines()
for i in content: print i
If you can use python3, you can actually do this:
with open('/home/naveen/Desktop/answer.txt', encoding='UTF-8') as ans:
content = ans.read().splitlines()
print(content)

There is a problem with your 'f' in Magnificent Mary . It is not the normal f , but it is the
LATIN SMALL LIGATURE FI . You can simply delete your 'f' and retype it in gedit.
To verify the difference , simply include
print [(ord(a),a) for a in (file.split("\n"))[0]]
at the end of your code for both the fs.
If there is no way to edit the file , you could first convert the string to unicode , and then use the unicodedata of python.
import unicodedata
file = open("answer.txt")
file = (file.read()).decode('utf-8')
print unicodedata.normalize('NFKD',
file).encode('ascii','ignore').split("\n")

Related

In python insert one space after every 5th Character in each line of a text file

I am reading a text file in python(500 rows) and it seems like:
File Input:
0082335401
0094446049
01008544409
01037792084
01040763890
I wanted to ask that is it possible to insert one space after 5th Character in each line:
Desired Output:
00823 35401
00944 46049
01008 544409
01037 792084
01040 763890
I have tried below code
st = " ".join(st[i:i + 5] for i in range(0, len(st), 5))
but the below output was returned on executing it:
00823 35401
0094 44604 9
010 08544 409
0 10377 92084
0104 07638 90
I am a novice in Python. Any help would make a difference.
There seems to be two issues here - By running your provided code, you seem to be reading the file into one single string. It would be much preferable (in your case) to read the file in as a list of strings, like the following (assuming your input file is input_data.txt):
# Initialize a list for the data to be stored
data = []
# Iterate through your file to read the data
with open("input_data.txt") as f:
for line in f.readlines():
# Use .rstrip() to get rid of the newline character at the end
data.append(line.rstrip("\r\n"))
Then, to operate on the data you obtained in a list, you could use a list comprehension similar to the one you have tried to use.
# Assumes that data is the result from the above code
data = [i[:5] + " " + i[5:] if len(i) > 5 else i for i in data]
Hope this helped!
If your only requirement is to insert a space after the fifth character than you could use the following simple version:
#!/usr/bin/env python
with open("input_data") as data:
for line in data.readlines():
line = line.rstrip()
if len(line) > 5:
print(line[0:5]+" "+line[5:])
else:
print(line)
If you don't mind if lines with less than five characters get a space at the end, you could even omit the if-else-statement and go with the print-function from the if-clause:
#!/usr/bin/env python
with open("input_data") as data:
for line in data.readlines():
line = line.rstrip()
print(line[0:5]+" "+line[5:])

readlines() function and unicodes

I have this file, testpi.txt, which i'd like to convert into a list of sentences.
>>>cat testpi.txt
This is math π.
That is moth pie.
Here's what I've done:
r = open('testpi.txt', 'r')
sentence_List = r.readlines()
print sentence_List
And, when the output is sent to another text file - output.txt , this is how it looks like in output.txt:
['This is math \xcf\x80. That is moth pie.\n']
I tried codecs too, r = codecs.open('testpi.txt', 'r',encoding='utf-8'),
but the output then consists of a leading 'u' in all the entries.
How could I display this byte string - \xcf\x80 as π, in the output.txt
Please guide me, thanks.
The problem is you're printing the entire list which gives you an output format you don't want. Instead, print each string individually and it will work:
r = open('t.txt', 'r')
sentence_List = r.readlines()
for line in sentence_List:
print line,
Or:
print "['{}']".format("', '".join(map(str.rstrip, sentence_List)))

Using Python CSV and glob to find matching strings and print row

I have hundreds of CSV files and I'm trying to write a Python script that will parse through all of them and print out rows that have matching string(s). I'll be happy if we can get this to work using one string (and not a list of strings). Using Python 2.7.5. I've figured out so far:
The csv module in Python will print the row with the matching string in a particular column (the eighth column from the left):
import csv
reader = csv.reader(open('2015-08-25.csv'))
for row in reader:
col8 = str(row[8])
if col8 == '36862210':
print row
So the above works for one .csv file. Now I need to parse hundreds of .csv files with glob. The glob module will print out all the file names with this code:
import glob
for name in glob.glob('20??-??-??.csv'):
print name
I tried putting the two together into one script but the error message reads:
File "test7.py", line 6, in
reader = csv.reader(open(csvfiles))
TypeError: coercing to Unicode: need string or buffer, list found
import csv
import glob
csvfiles = glob.glob('20??-??-??.csv')
for filename in csvfiles:
reader = csv.reader(open(csvfiles))
for row in reader:
col8 = str(row[8])
if col8 == '36862210':
print row
You are trying to open a List - csvfiles is the list you are iterating on.
Use this instead, because open() expects a filename:
reader = csv.reader(open(filename))

I need to write a Python stub to print names of image files and whether they are blurry or not

New user here, and just started Python a few days ago!
My question is:
I need to write a Python stub to print names of image files and whether they are blurry or not. They are considered blurry if the value is > 0.3. There are 5 bits of information in each line, the second bit (index 1) is the number in question. In total there are 1868 lines.
Here is a sample of the data:
['out04-32-44-03.tif,0.295554,536047.6051,5281850.4252,19.8091\n',
'out04-32-44-15.tif,0.337232,536047.2831,5281850.5974,19.8256\n',
'out04-32-44-27.tif,0.2984,536046.9611,5281850.7696,19.8420\n',
'out04-32-44-39.tif,0.311989,536046.6392,5281850.9418,19.8584\n',
'out04-32-44-51.tif,0.346901,536046.3172,5281851.1140,19.8749\n',
'out04-32-44-63.tif,0.358519,536045.9953,5281851.2862,19.8913\n',
'out04-32-44-75.tif,0.342837,536045.6733,5281851.4584,19.9078\n',
'out04-32-44-87.tif,0.32909,536045.3513,5281851.6306,19.9242\n',
'out04-32-44-99.tif,0.294824,536045.0294,5281851.8028,19.9406\n']
Any suggestions greatly appreciated :-)
Based on the code you have written in the comments. This is for python 2.7
fin = open('E:\KGG 375 - GIS Advanced\Assignment 2 - Python\TIR043109gpxpos.txt')
for line in fin: # no need to read these into a list first
info = line.split(',')
blurry = float(info[1])
print info[0],
if blurry > 0.3:
print ' is blurry'
else:
print ' is not blurry'
Explanation:
There is no need to read the lines of a file to a list, you can just iterate over a file and it will read line by line
To be able to compare against a float, you need to convert the 2nd element (info[1]) into a float.
print info[0], will print the filename and the comma will prevent a line break so " is blurry" will print out to the same line. HOX! This is python2.7 syntax so it will not work with python 3.x

Why does my Python code add additional characters when writing and reading a file

I am just learning how to code in Python and have not been able to find a solution or answer as to why when I attempt to read a file that has just been written to it bears additional characters.
Code
#-*-coding:utf-8-*-
from sys import argv
from os.path import exists
script, source, copy = argv
print "We'll be opening, reading, writing to and closing a file"
opensource = open(source)
readsource = opensource.read()
print readsource
print "Great. We opened and read file"
opencopy = open(copy, 'w+') #we want to write and read file
opencopy.write(readsource) #copy the contents of the source file
opencopy.read()
opensource.close()
opencopy.close()
Output
Contents
test °D ΃ ø U ø U ` 6 ` 6 0M Ð
I am running version 2.7 of Python on Windows 7 Professional 64bit.
This seems to be a Windows issue with reading a file opened with "w+" directly after a write.
Start with adding two print statements like this:
opencopy.write(readsource) #copy the contents of the source file
print opencopy.tell()
opencopy.read()
print opencopy.tell()
And run this on a file with only as contents the words 'test' + CR + LF, you get as output:
We'll be opening, reading, writing to and closing a file
test
Great. We opened and read file
6
4098
(If you do the same under Linux, the read does not work beyond the end of the file (and you get two times the value 6 from opencopy.tell().)
What you probably want to do is:
print opencopy.tell()
opencopy.seek(0)
print opencopy.tell()
opencopy.read()
print opencopy.tell()
You then get 6 and 6 as output from tell(). Now this results in reading the word 'test', that
you just wrote.
If you do not want to read what you just wrote, put opencopy.flush() between the read and the write statement:
opencopy.write(readsource) #copy the contents of the source file
print opencopy.tell()
opencopy.flush()
opencopy.read()
print opencopy.tell()