Using scraperwiki for pdf-file on disk - python-2.7

I am trying to get some data out of a pdf document using scraperwiki for pyhon. It works beautifully if I download the file using urllib2 like so:
pdfdata = urllib2.urlopen(url).read()
xmldata = scraperwiki.pdftoxml(pdfdata)
root = lxml.html.fromstring(xmldata)
pages = list(root)
But here comes the tricky part. As I would like to do this for a large number of pdf-files that I have on my disk, I would like to do away with the first line and pass the pdf file directly as an argument. However, if I try
pdfdata = open("filename.pdf","wb")
xmldata = scraperwiki.pdftoxml(pdfdata)
root = lxml.html.fromstring(xmldata)
I get the following error
xmldata = scraperwiki.pdftoxml(pdfdata)
File "/usr/local/lib/python2.7/dist-packages/scraperwiki/utils.py", line 44, in pdftoxml
pdffout.write(pdfdata)
TypeError: must be string or buffer, not file
I am guessing that this occurs because I do not open the pdf correctly?
If so, is there a way to open a pdf from disk just like urllib2.urlopen() does?

urllib2.urlopen(...).read() does just that it reads the contents of the stream returned from the url you passed as a parameter.
While open() returns a file handler. Just as urllib2 needed to do an open() call then a read() call so does file handlers.
Change your program to use the the following lines:
with open("filename.pdf", "rb") as pdffile:
pdfdata=pdffile.read()
xmldata = scraperwiki.pdftoxml(pdfdata)
root = lxml.html.fromstring(xmldata)
This will open your pdf then read the contents into a buffer named pdfdata. From there your call to scraperwiki.pdftoxml() will work as expected.

Related

How can I conver a transcribed .wav into txt in full extent. - Google Speech API

I'm having trouble with converting full transcribed speech to a text file. Eventually, I get what I need but not the entire text from the audio file. Let me note this (1 Pic), I can see the whole text when I use print() function but get only one line of that text when I try to write it to .txt file (2 Pic).
Also, you can look at my code if you need additional info and stuff. Thank you in advance!
from google.cloud import speech
import os
os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = 'PATH'
client = speech.SpeechClient()
with open('sample.wav', "rb") as audio_file:
content = audio_file.read()
audio = speech.RecognitionAudio(content=content)
config = speech.RecognitionConfig(
encoding=speech.RecognitionConfig.AudioEncoding.LINEAR16,
sample_rate_hertz=8000,
language_code="en-US",
# Enable automatic punctuation
enable_automatic_punctuation=True,
)
response = client.recognize(config=config, audio=audio)
for result in response.results:
extr = result.alternatives[0].transcript
print(extr)
with open("guru9.txt","w+") as f:
f.write(extr)
f.close()
What happens in your code is, per iteration you open, write, close your file. You should move out your opening and closing of your file outside the loop.
myfile = open("guru9.txt","w+")
for result in response.results:
extr = result.alternatives[0].transcript
myfile.write(extr)
myfile.close()

Issue with writing multiple lines into a file in python

I want to download multiple specific links(images´ urls) into a txt file(or any file where all links can be listed underneath each others).
I get them but the code wrtite each link on the top of the other one and at the end it stays only a link :(. Also I want not repeated urls
def dlink(self, image_url):
r = self.session.get(image_url, stream=True)
with open('Output.txt','w') as f:
f.write(image_url + '\n')
The issue is most simply that opening a file with mode 'w' truncates any existing file. You should change 'w' to 'a' instead. This will open an existing file for writing, but append instead of truncating.
More fundamentally, the problem may be that you are opening the file over and over in a loop. This is very inefficient. The only time the approach you use could be really useful is if your program is approaching the OS-imposed limit on number of open files. If this is not the case, I would recommended putting the loop inside the with block, keeping the mode as 'w' since you open the file just once now, and passing the open file to your dlink function.
Edit
Huge mistake of my part, as it is a method, and you will call it several times, if you open it in write mode ('w') or similar, it will Overwrites the existing file if the file exists.
So, if you use the 'a' way, you can see that:
Opens a file for appending. The file pointer is at the end of the file
if the file exists. That is, the file is in the append mode. If the
file does not exist, it creates a new file for writing.
The other problem radics in image_url is a list, so you need to write it line by line:
def dlink(self, image_url):
r = self.session.get(image_url, stream=True)
with open('Output.txt','a') as f:
for url in list(set(image_url)):
f.write(image_url + '\n')
another way to do it:
your_file = open('Output.txt', 'a')
r = self.session.get(image_url, stream=True)
for url in list(set(image_url)):
your_file.write("%s\n" % url)
your_file.close() #dont forget close it :)
the file open mode is wrong,'w' mode make this file was overwritten every time you open it,not appended to it. replace it to 'a' mode.
you can see this https://stackoverflow.com/a/23566951/8178794 for more detail
Open a file with option w overwrite the file if existring, use the mode a to append data to an existing file.
Try :
import requests
from os.path import splitext
# use mode='a' to append result without erasing filename
def dlink(url, filename, mode='w'):
r = requests.get(url)
if r.status_code != 200:
return
# here the link is valid
with open(filename, mode) as desc:
desc.write(url)
def dimg(img_url, img_name):
r = requests.get(img_url, stream=True)
if r.status_code != 200:
return
_, ext = splitext(img_url)
with open(img_name + ext, 'wb') as desc:
for chunk in r:
desc.write(chunk)
dlink('https://image.flaticon.com/teams/slug/freepik.jpg', 'links.txt')
dlink('https://image.flaticon.com/teams/slug/freepik.jpg', 'links.txt', 'a')
dimg('https://image.flaticon.com/teams/slug/freepik.jpg', 'freepik')

Django how to open file in FileField

I need to open a file saved in a FileField, create a list with the content of the file and pass it to the template. How can I open the file? I tried with open(stocklist.csv_file.url, "wb") but it gave me a "File not found" error. If I do this:
csv_file = stocklist.csv_file.open(mode="rb")
csv_file is None. However, there is a file. If I print print("stocklist.csv_file.url: %s" % stocklist.csv_file.url) I do get
stocklist.csv_file: https://d391vo1.cloudfront.net/csv_pricechart/...ss7.csv
And if I go to the admin, I can download the file. So, how can I open a file saved in a FileField?
The .open() opens the file cursor but does not return it, since it depends of your storage (filesystem, S3, FTP...). Once opened, you can use .read() to iterate over the file content.
stocklist.csv_file.open(mode="rb")
content = stocklist.csv_file.read()
stocklist.csv_file.close()
If you want to specifically work with file descriptor then you can use your storage functionality:
from django.core.files.storage import DefaultStorage
storage = DefaultStorage()
f = storage.open(stocklist.csv_file.name, mode='rb')

Creating and then writing to a file

So I want to read in a text file and then use some of that to write to another file that doesn't exist in the same directory. So for instance if I have a file named text.txt, I want to write a script that reads it and then creates another file, text2.txt which has some of its contents determined by what was in text.txt.
To read the file I'm using the command,
with open(inpath, 'r') as f:
...
But then what is the preferred way to create a new file and start writing to it? If I had to guess, I'd think it would be
with open(inpath, 'r') as f:
outtext = open(outpath, 'w')
...
where the variable outpath stores the directory of the file to be written. If I understand all this correctly, if the directory outpath happens to exist, running this script would destroy it or at least append to it. But if it doesn't exist, then Python would create the file. Is that accurate? And is there a better, more elegant way to do this?
I believe inpath and outpath are absolute paths. So you cannot do:
with open(inpath, 'r') as f:
...
It will throw IOError exception. open method expects a file path, but since you are providing path to a directory, exception occurs. The same applies to outpath also. Now Lets assume values of inpath and outpath as:
input_path = '/Users/avi/inputs'
output_path = '/Users/avi/outputs'
Now, to read a file, you could do:
input_file_path = os.path.join(input_path, 'input.txt')
The input_file_path will be now /Users/avi/inputs/input.txt
and to open this:
with open(input_file_path, 'r') as f:
...
Now coming to second question, yes, if file already exists python will overwrite. If it does not, it creates a new one. So you can first check whether file exists or not. If it does, then you can create a new one:
output_path_file = os.path.join(output_path, 'output.txt')
if os.path.isfile(output_path_file):
# file already exists
# do something else like create another file
output_path_file = os.path.join(output_path, 'new_output.txt')
# now write to output file
with open(output_file_path, 'w') as f:
...

PYPDF watermarking returns error

hi im trying to watermark a pdf fileusing pypdf2 though i get this error i cant figure out what goes wrong.
i get the following error:
Traceback (most recent call last): File "test.py", line 13, in <module>
page.mergePage(watermark.getPage(0)) File "C:\Python27\site-packages\PyPDF2\pdf.py", line 1594, in mergePage
self._mergePage(page2) File "C:\Python27\site-packages\PyPDF2\pdf.py", line 1651, in _mergePage
page2Content, rename, self.pdf) File "C:Python27\site-packages\PyPDF2\pdf.py", line 1547, in
_contentStreamRename
op = operands[i] KeyError: 0
using python 2.7.6 with pypdf2 1.19 on windows 32bit.
hopefully someone can tell me what i do wrong.
my python file:
from PyPDF2 import PdfFileWriter, PdfFileReader
output = PdfFileWriter()
input = PdfFileReader(open("test.pdf", "rb"))
watermark = PdfFileReader(open("watermark.pdf", "rb"))
# print how many pages input1 has:
print("test.pdf has %d pages." % input.getNumPages())
print("watermark.pdf has %d pages." % watermark.getNumPages())
# add page 0 from input, but first add a watermark from another PDF:
page = input.getPage(0)
page.mergePage(watermark.getPage(0))
output.addPage(page)
# finally, write "output" to document-output.pdf
outputStream = file("outputs.pdf", "wb")
output.write(outputStream)
outputStream.close()
Try writing to a StringIO object instead of a disk file. So, replace this:
outputStream = file("outputs.pdf", "wb")
output.write(outputStream)
outputStream.close()
with this:
outputStream = StringIO.StringIO()
output.write(outputStream) #write merged output to the StringIO object
outputStream.close()
If above code works, then you might be having file writing permission issues. For reference, look at the PyPDF working example in my article.
I encountered this error when attempting to use PyPDF2 to merge in a page which had been generated by reportlab, which used an inline image canvas.drawInlineImage(...), which stores the image in the object stream of the PDF. Other PDFs that use a similar technique for images might be affected in the same way -- effectively, the content stream of the PDF has a data object thrown into it where PyPDF2 doesn't expect it.
If you're able to, a solution can be to re-generate the source pdf, but to not use inline content-stream-stored images -- e.g. generate with canvas.drawImage(...) in reportlab.
Here's an issue about this on PyPDF2.