open pdf without text with python - django

I want open a PDF for a Django views but my PDF has not a text and python returns me a blank PDF.
On each page, this is a scan of a page : link
from django.http import HttpResponse
def views_pdf(request, path):
with open(path) as pdf:
response = HttpResponse(pdf.read(),content_type='application/pdf')
response['Content-Disposition'] = 'inline;elec'
return response
Exception Type: UnicodeDecodeError
Exception Value: 'charmap' codec can't decode byte 0x9d in position 373: character maps to < undefined >
Unicode error hint
The string that could not be encoded/decoded was: � ��`����
How to say at Python that is not a text but a picture ?

By default, Python 3 opens files in text mode, that is, it tries to interpret the contents of a file as text. This is what causes the exception that you see.
Since a PDF file is (generally) a binary file, try opening the file in binary mode. In that case, read() will return a bytes object.
Here's an example (in IPython). First, opening as text:
In [1]: with open('2377_001.pdf') as pdf:
...: data = pdf.read()
...:
---------------------------------------------------------------------------
UnicodeDecodeError Traceback (most recent call last)
<ipython-input-1-d807b6ccea6e> in <module>()
1 with open('2377_001.pdf') as pdf:
----> 2 data = pdf.read()
3
/usr/local/lib/python3.6/codecs.py in decode(self, input, final)
319 # decode input (taking the buffer into account)
320 data = self.buffer + input
--> 321 (result, consumed) = self._buffer_decode(data, self.errors, final)
322 # keep undecoded input until the next call
323 self.buffer = data[consumed:]
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe2 in position 10: invalid continuation byte
Next, reading the same file in binary mode:
In [2]: with open('2377_001.pdf', 'rb') as pdf:
...: data = pdf.read()
...:
In [3]: type(data)
Out[3]: bytes
In [4]: len(data)
Out[4]: 45659
In [5]: data[:10]
Out[5]: b'%PDF-1.4\n%'
That solves the first part, how to read the data.
The second part is how to pass it to a HttpResponse. According to the Django documentation:
"Typical usage is to pass the contents of the page, as a string, to the HttpResponse constructor"
So passing bytes might or might not work (I don't have Django installed to test). The Django book says:
"content should be an iterator or a string."
I found the following gist to write binary data:
from django.http import HttpResponse
def django_file_download_view(request):
filepath = '/path/to/file.xlsx'
with open(filepath, 'rb') as fp: # Small fix to read as binary.
data = fp.read()
filename = 'some-filename.xlsx'
response = HttpResponse(mimetype="application/ms-excel")
response['Content-Disposition'] = 'attachment; filename=%s' % filename # force browser to download file
response.write(data)
return response

The problem is probably that the file you are trying to using isn't using the correct type of encoding. You can easily find the encoding of your pdf in most pdf viewers like adobe acrobat (in properties). Once you've found out what encoding it's using you can give it to Python like so:
Replace
with open(path) as pdf:
with :
with open(path, encoding="whatever encoding your pdf is in") as pdf:
Try Latin-1 encoding this often works

Related

Change file encoding for avro/hdfs

My intention is to read files from hdfs and parse them into json by using python. The files had been created by flume (twitter api), so I assume they are streamed avro files by default.
However, it seems avro is expecting UTF-8 but does get some other encoding provided by the file. What can I do to parse those files successfully?
import avro.schema
from avro.datafile import DataFileReader, DataFileWriter
tf = avro.schema.Parse(open("/home/user/Data/hdfs/twitter_data/FlumeData.1534507350470", "rt").read())
---------------------------------------------------------------------------
UnicodeDecodeError Traceback (most recent call last)
<ipython-input-31-8f116b790f79> in <module>()
----> 1 tf = avro.schema.Parse(open("/home/user/Data/hdfs/twitter_data/FlumeData.1534507350470", "rt").read())
/usr/lib64/python3.6/codecs.py in decode(self, input, final)
319 # decode input (taking the buffer into account)
320 data = self.buffer + input
--> 321 (result, consumed) = self._buffer_decode(data, self.errors, final)
322 # keep undecoded input until the next call
323 self.buffer = data[consumed:]
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe4 in position 17: invalid continuation byte

Django encoding error when reading from a CSV

When I try to run:
import csv
with open('data.csv', 'rU') as csvfile:
reader = csv.DictReader(csvfile)
for row in reader:
pgd = Player.objects.get_or_create(
player_name=row['Player'],
team=row['Team'],
position=row['Position']
)
Most of my data gets created in the database, except for one particular row. When my script reaches the row, I receive the error:
ProgrammingError: You must not use 8-bit bytestrings unless you use a
text_factory that can interpret 8-bit bytestrings (like text_factory = str).
It is highly recommended that you instead just switch your application to Unicode strings.`
The particular row in the CSV that causes this error is:
>>> row
{'FR\xed\x8aD\xed\x8aRIC.ST-DENIS', 'BOS', 'G'}
I've looked at the other similar Stackoverflow threads with the same or similar issues, but most aren't specific to using Sqlite with Django. Any advice?
If it matters, I'm running the script by going into the Django shell by calling python manage.py shell, and copy-pasting it in, as opposed to just calling the script from the command line.
This is the stacktrace I get:
Traceback (most recent call last):
File "<console>", line 4, in <module>
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/csv.py", line 108, in next
row = self.reader.next()
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/codecs.py", line 302, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xcc in position 1674: invalid continuation byte
EDIT: I decided to just manually import this entry into my database, rather than try to read it from my CSV, based on Alastair McCormack's feedback
Based on the output from your question, it looks like the person who made the CSV mojibaked it - it doesn't seem to represent FRÉDÉRIC.ST-DENIS. You can try using windows-1252 instead of utf-8 but I think you'll end up with FRíŠDíŠRIC.ST-DENIS in your database.
I suspect you're using Python 2 - open() returns str which are simply byte strings.
The error is telling you that you need to decode your text to Unicode string before use.
The simplest method is to decode each cell:
with open('data.csv', 'r') as csvfile: # 'U' means Universal line mode and is not necessary
reader = csv.DictReader(csvfile)
for row in reader:
pgd = Player.objects.get_or_create(
player_name=row['Player'].decode('utf-8),
team=row['Team'].decode('utf-8),
position=row['Position'].decode('utf-8)
)
That'll work but it's ugly add decodes everywhere and it won't work in Python 3. Python 3 improves things by opening files in text mode and returning Python 3 strings which are the equivalent of Unicode strings in Py2.
To get the same functionality in Python 2, use the io module. This gives you a open() method which has an encoding option. Annoyingly, the Python 2.x CSV module is broken with Unicode, so you need to install a backported version:
pip install backports.csv
To tidy your code and future proof it, do:
import io
from backports import csv
with io.open('data.csv', 'r', encoding='utf-8') as csvfile:
reader = csv.DictReader(csvfile)
for row in reader:
# now every row is automatically decoded from UTF-8
pgd = Player.objects.get_or_create(
player_name=row['Player'],
team=row['Team'],
position=row['Position']
)
Encode Player name in utf-8 using .encode('utf-8') in player name
import csv
with open('data.csv', 'rU') as csvfile:
reader = csv.DictReader(csvfile)
for row in reader:
pgd = Player.objects.get_or_create(
player_name=row['Player'].encode('utf-8'),
team=row['Team'],
position=row['Position']
)
In Django, decode with latin-1, csv.DictReader(io.StringIO(csv_file.read().decode('latin-1'))), it would devour all special characters and all comma exceptions you get in utf-8.

scrapy: Writing unicode in a file writing Pipeline

I have a scrapy Pipeline defined that should write any Item Field crawled by a scraper to text. One of the fields contains HTML code. I'm having issues writing it to file due to the notorious Unicode error:
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa0' in position 100: ordinal not in range(128)
Scrapy can write out all of the fields as json in the logfile. Could someone explain what needs to be done to handle the character encoding for writing files? Thanks in advance.
import scrapy
import codecs
class SupportPipeline(object):
def process_item(self, item, spider):
for key, value in item.iteritems():
with codecs.open("%s.%s" % (prefix, key), 'wb', 'utf-8') as f:
# with open("%s.%s" % (prefix, key), 'wb') as f:
f.write(value.encode('utf-8'))
return item

PYPDF watermarking returns error

hi im trying to watermark a pdf fileusing pypdf2 though i get this error i cant figure out what goes wrong.
i get the following error:
Traceback (most recent call last): File "test.py", line 13, in <module>
page.mergePage(watermark.getPage(0)) File "C:\Python27\site-packages\PyPDF2\pdf.py", line 1594, in mergePage
self._mergePage(page2) File "C:\Python27\site-packages\PyPDF2\pdf.py", line 1651, in _mergePage
page2Content, rename, self.pdf) File "C:Python27\site-packages\PyPDF2\pdf.py", line 1547, in
_contentStreamRename
op = operands[i] KeyError: 0
using python 2.7.6 with pypdf2 1.19 on windows 32bit.
hopefully someone can tell me what i do wrong.
my python file:
from PyPDF2 import PdfFileWriter, PdfFileReader
output = PdfFileWriter()
input = PdfFileReader(open("test.pdf", "rb"))
watermark = PdfFileReader(open("watermark.pdf", "rb"))
# print how many pages input1 has:
print("test.pdf has %d pages." % input.getNumPages())
print("watermark.pdf has %d pages." % watermark.getNumPages())
# add page 0 from input, but first add a watermark from another PDF:
page = input.getPage(0)
page.mergePage(watermark.getPage(0))
output.addPage(page)
# finally, write "output" to document-output.pdf
outputStream = file("outputs.pdf", "wb")
output.write(outputStream)
outputStream.close()
Try writing to a StringIO object instead of a disk file. So, replace this:
outputStream = file("outputs.pdf", "wb")
output.write(outputStream)
outputStream.close()
with this:
outputStream = StringIO.StringIO()
output.write(outputStream) #write merged output to the StringIO object
outputStream.close()
If above code works, then you might be having file writing permission issues. For reference, look at the PyPDF working example in my article.
I encountered this error when attempting to use PyPDF2 to merge in a page which had been generated by reportlab, which used an inline image canvas.drawInlineImage(...), which stores the image in the object stream of the PDF. Other PDFs that use a similar technique for images might be affected in the same way -- effectively, the content stream of the PDF has a data object thrown into it where PyPDF2 doesn't expect it.
If you're able to, a solution can be to re-generate the source pdf, but to not use inline content-stream-stored images -- e.g. generate with canvas.drawImage(...) in reportlab.
Here's an issue about this on PyPDF2.

Encoding error when writing in gml file

In one of my previous posts I had a problem with reading and writing strings that are in a language different from English. The problem was in the encoding of my system. ton1c mentioned that writing the strings in a txt is fine and indeed it is! Now I am trying to pass these string in a gml file and I am encountering a problem with the encoding again. Here is the code and the results.
import urllib2
import BeautifulSoup
import networkx as nx
url = 'http://www.bbc.co.uk/zhongwen/simp/'
page = urllib2.urlopen(url).read().decode("utf-8")
dom = BeautifulSoup.BeautifulSoup(page)
data = dom.findAll('meta', {'name' : 'keywords'})
data = data.encode("utf-8")
datalist = data.split(',')
G = nx.Graph()
G.add_node( "name", Strings = datalist );
It returns
File "C:\...\name.py", line 23, in <module> nx.write_gml(G, 'Gname')
File "<string>", line 2, in write_gml
File "C:\Python27\lib\site-packages\networkx\utils\decorators.py", line 263, in _open_file
result = func(*new_args, **kwargs)
File "C:\Python27\lib\site-packages\networkx\readwrite\gml.py", line 392, in write_gml
path.write(line.encode('latin-1'))
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe4 in position 13: ordinal not in range(128)
Any suggestions? I would like also to mention that in the site of networkx it mentions GML specifications indicate that the file should only use 7bit ASCII text encoding.iso8859-1 (latin-1). (http://networkx.lanl.gov/reference/generated/networkx.readwrite.gml.write_gml.html)
PS: Please any suggestion in Python 2.7 compatibility please.
You just do the following:
import urllib2
import BeautifulSoup
import networkx as nx
url = 'http://www.bbc.co.uk/zhongwen/simp/'
page = urllib2.urlopen(url).read().decode("latin-1")
dom = BeautifulSoup.BeautifulSoup(page)
data = dom.findAll('meta', {'name' : 'keywords'})
data = data[0]['content'].encode("latin-1")
#datalist = data.split(',')
with open("tags.txt", "w") as text_file:
text_file.write("%s"%data)
G = nx.Graph()
G.add_node( "name", Strings = data.decode("latin-1") );
nx.write_gml(G,"test.gml")
graph [
node [
id 0
label "name"
Strings "BBC中文网,主页,国际新闻,中国新闻,台湾新闻,香港新闻,英国新闻,信息,财经,科技,卫生 互动,多媒体,视频,音频,图辑,bbcchinese.com, homepage, world news, China news, uk news, hong kong, taiwan, sci-tech, business, interactive, forum"
]
]