Python pdf to txt - python-2.7

I would like to convert pdf file to txt. Here is my code:
testFile = urllib.URLopener()
testFile.retrieve("http://url_to_download" , "/Users/gabor_dev/Desktop/pdf_tst/tst.pdf")
content = ""
pdf = pyPdf.PdfFileReader(file("/Users/gabor_dev/Desktop/pdf_tst/tst.pdf", "rb"))
for i in range(0, pdf.getNumPages()):
f = open("/Users/gabor_dev/Desktop/pdf_tst/xxx.txt",'a')
content= pdf.getPage(i).extractText() + "\n"
c=content.split()
for a in c:
f.write(" ")
f.write(a)
f.write('\n')
f.close()
My pdf is downloaded, but when I try to convert it to my txt only the first word of the pdf shows up in my txt file, and then I get this error:
Traceback (most recent call last):
File "/Users/gabor_dev/PycharmProjects/text_class_tst/textClass.py", line 26, in <module>
f.write(" ")
ValueError: I/O operation on closed file
What am I doing wrong?
Thank you!

Better use with open :
import urllib
import pyPdf
testFile = urllib.URLopener()
testFile.retrieve("http://www.pdf995.com/samples/pdf.pdf" , "./tst.pdf")
content = ""
pdf = pyPdf.PdfFileReader(file("./tst.pdf", "rb"))
with open("./xxx.txt",'a') as f :
for i in range(0, pdf.getNumPages()):
content= pdf.getPage(i).extractText() + "\n"
c=content.split()
for a in c:
f.write(" ")
f.write(a)
f.write('\n')
Tested and works

Related

Python read variable with French characters from file

I have a set of text files in which variables are stored, which I am trying to read into Python. As long as the variables do not contain any French characters, e.g. é, ç, etc. The following piece of code works well:
#!/usr/bin/python
import imp
def getVarFromFile(filename):
f=open (filename, 'rt')
global data
data = imp.load_source('data', " ", f)
f.close()
return()
def main():
getVarFromFile('test.txt')
print data.Title
print data.Language
print data.Summary
return()
if __name__ == "__main__":
main()
Example output:
me#mypc:$ ./readVar.py
Monsieur Flaubert
French
A few lines of text.
However when the text file contains French characters, for instance:
Title = "Monsieur Flaubert"
Language = "Français"
Summary = "Quelques lignes de texte en Français. é à etc."
I am getting the following error for which I cannot find a solution:
Traceback (most recent call last):
File "./tag.py", line 30, in <module>
main()
File "./tag.py", line 22, in main
getVarFromFile('test.txt')
File "./tag.py", line 15, in getVarFromFile
data = imp.load_source('data', " ", f)
File " ", line 2
SyntaxError: Non-ASCII character '\xc3' in file on line 2, but no encoding declared; see http://python.org/dev/peps/pep-0263/ for details
How can French (utf-8) characters be handled?
Thanks for your consideration and help to this Python-learner.
You could use codecs.open.
data = {}
with codecs.open('test.txt', encoding='utf-8') as f:
for line in f.readlines():
# Use some logic here to load each line into a dict, like:
key, value = line.split(" = ")
data[key] = value
This solution doesn't use imp, it requires that you implement your own logic to interpret the contents of the file.

Passing stdin for cpp program using a Python script

I'm trying to write a python script which
1) Compiles a cpp file.
2) Reads a text file "Input.txt" which has to be fed to the cpp file.
3) Compare the output with "Output.txt" file and Print "Pass" if all test cases have passed successfully else print "Fail".
`
import subprocess
from subprocess import PIPE
from subprocess import Popen
subprocess.call(["g++", "eg.cpp"])
inputFile = open("input.txt",'r')
s = inputFile.readlines()
for i in s :
proc = Popen("./a.out", stdin=int(i), stdout=PIPE)
out = proc.communicate()
print(out)
`
For the above code, I'm getting an output like this,
(b'32769', None)
(b'32767', None)
(b'32768', None)
Traceback (most recent call last):
File "/home/zanark/PycharmProjects/TestCase/subprocessEg.py", line 23, in <module>
proc = Popen("./a.out", stdin=int(i), stdout=PIPE)
ValueError: invalid literal for int() with base 10: '\n'
PS :- eg.cpp contains code to increment the number from the "Input.txt" by 2.
pass the string to communicate instead, and open your file as binary (else python 3 won't like it / you'll have to encode your string as bytes):
with open("input.txt",'rb') as input_file:
for i in input_file:
print("Feeding {} to program".format(i.decode("ascii").strip())
proc = Popen("./a.out", stdin=PIPE, stdout=PIPE)
out,err = proc.communicate(input=i)
print(out)
also don't convert input of the file to integer. Leave it as string (I suspect you'll have to filter out blank lines, though)

Python Regex Issue Involving "TypeError: expected string or bytes-like object" [duplicate]

This question already has answers here:
TypeError: expected string or buffer
(5 answers)
Closed 3 years ago.
I've been trying to parse a text file and manipulate it with regular expressions.
This is my script:
import re
original_file = open('jokes.txt', 'r+')
original_file.read()
original_file = re.sub("\d+\. ", "", original_file)
How to fix the following error:
Traceback (most recent call last):
File "filedisplay.py", line 4, in <module>
original_file = re.sub("\d+\. ", "", original_file)
File "C:\Python32\lib\re.py", line 167, in sub
return _compile(pattern, flags).sub(repl, string, count)
TypeError: expected string or buffer
And why am I getting this error?
original_file is a file object, you need to read it to get its contents, or the buffer that the regex requires.
Usually, it's also good that you use with (just so you don't have to remember closing the file), so you might end up with something like this:
import re
with open('jokes.txt', 'r+') as original_file:
contents = original_file.read()
new_contents = re.sub(r"\d+\. ", "", contents)
You will see I rawed the regex string up there in the code (I used an r before the regex string). That's also a good practice, because sometimes you will have to double escape some characters for them to behave properly as you expect them.
You call original_file.read(), but you don't assign that value to anything.
>>> original_file = open('test.txt', 'r+')
>>> original_file.read()
'Hello StackOverflow,\n\nThis is a test!\n\nRegards,\naj8uppal\n'
>>> print original_file
<open file 'test.txt', mode 'r+' at 0x1004bd250>
>>>
Therefore, you need to assign original_file = original_file.read():
import re
original_file = open('jokes.txt', 'r+')
original_file = original_file.read()
original_file = re.sub("\d+\. ", "", original_file)
I would also suggest using with like #Jerry, so that you don't have to close the file to save the writing.

python readline from big text file

When I run this:
import os.path
import pyproj
srcProj = pyproj.Proj(proj='longlat', ellps='GRS80', datum='NAD83')
dstProj = pyproj.Proj(proj='longlat', ellps='WGS84', datum='WGS84')
f = file(os.path.join("DISTAL-data", "countries.txt"), "r")
heading = f.readline() # Ignore field names.
with open('C:\Python27\DISTAL-data\geonames_20160222\countries.txt', 'r') as f:
for line in f.readlines():
parts = line.rstrip().split("|")
featureName = parts[1]
featureClass = parts[2]
lat = float(parts[9])
long = float(parts[10])
if featureClass == "Populated Place":
long,lat = pyproj.transform(srcProj, dstProj, long, lat)
f.close()
I get this error:
File "C:\Python27\importing world datacountriesfromNAD83 toWGS84.py",
line 13, in for line in f.readlines() : MemoryError.
I have downloaded countries file from http://geonames.nga.mil/gns/html/namefiles.html as entire country file dataset.
Please help me to get out of this.
readlines() for large files creates a large structure in memory, you can try using:
f = open('somefilename','r')
for line in f:
dosomthing()
Answer given by Yael is helpful, I would like to improve it. A Good way to read a file or large file
with open(filename) as f:
for line in f:
print f
I like to use 'with' statement which ensure file will be properly closed.

downloading multiple files with urllib.urlretrieve

I'm trying to download multiple files from a website.
The url resembles this: foo.com/foo-1.pdf.
Since I want those files to be stored in a directory of my choice,
I have written the following code:
import os
from urllib import urlretrieve
ext = ".pdf"
for i in range(1,37):
print "fetching file " + str(i)
url = "http://foo.com/Lec-" + str(i) + ext
myPath = "/dir/"
filename = "Lec-"+str(i)+ext
fullfilename = os.path.join(myPath, filename)
x = urlretrieve(url, fullfilename)
EDIT : Complete error message.
Traceback (most recent call last):
File "scraper.py", line 10, in <module>
x = urlretrieve(url, fullfilename)
File "/usr/lib/python2.7/urllib.py", line 94, in urlretrieve
return _urlopener.retrieve(url, filename, reporthook, data)
File "/usr/lib/python2.7/urllib.py", line 244, in retrieve
tfp = open(filename, 'wb')
IOError: [Errno 2] No such file or directory: /dir/Lec-1.pdf'
I'd be grateful if someone could point out where I have gone wrong.
Thanks in advance!
As for me your code works (Python3.9). So make sure your script has access to the directory you've specified. Also, it looks like you are trying to open a file which does not exist. So make sure you've downloaded the file before opening it:
fullfilename = os.path.abspath("d:/DownloadedFiles/Lec-1.pdf")
print(fullfilename)
if os.path.exists(fullfilename): # open file only if it exists
with open(fullfilename, 'rb') as file:
content = file.read() # read file's content
print(content[:150]) # print only the first 150 characters
The output would be as follows:
C:/Users/Administrator/PycharmProjects/Tests/dtest.py
d:\DownloadedFiles\Lec-1.pdf
b'%PDF-1.6\r%\xe2\xe3\xcf\xd3\r\n2346 0 obj <</Linearized 1/L 1916277/O 2349/E 70472/N 160/T 1869308/H [ 536 3620]>>\rendobj\r \r\nxref\r\n2346 12\r\n0000000016 00000 n\r'
Process finished with exit code 0