Change file encoding for avro/hdfs - hdfs

My intention is to read files from hdfs and parse them into json by using python. The files had been created by flume (twitter api), so I assume they are streamed avro files by default.
However, it seems avro is expecting UTF-8 but does get some other encoding provided by the file. What can I do to parse those files successfully?
import avro.schema
from avro.datafile import DataFileReader, DataFileWriter
tf = avro.schema.Parse(open("/home/user/Data/hdfs/twitter_data/FlumeData.1534507350470", "rt").read())
---------------------------------------------------------------------------
UnicodeDecodeError Traceback (most recent call last)
<ipython-input-31-8f116b790f79> in <module>()
----> 1 tf = avro.schema.Parse(open("/home/user/Data/hdfs/twitter_data/FlumeData.1534507350470", "rt").read())
/usr/lib64/python3.6/codecs.py in decode(self, input, final)
319 # decode input (taking the buffer into account)
320 data = self.buffer + input
--> 321 (result, consumed) = self._buffer_decode(data, self.errors, final)
322 # keep undecoded input until the next call
323 self.buffer = data[consumed:]
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe4 in position 17: invalid continuation byte

Related

open pdf without text with python

I want open a PDF for a Django views but my PDF has not a text and python returns me a blank PDF.
On each page, this is a scan of a page : link
from django.http import HttpResponse
def views_pdf(request, path):
with open(path) as pdf:
response = HttpResponse(pdf.read(),content_type='application/pdf')
response['Content-Disposition'] = 'inline;elec'
return response
Exception Type: UnicodeDecodeError
Exception Value: 'charmap' codec can't decode byte 0x9d in position 373: character maps to < undefined >
Unicode error hint
The string that could not be encoded/decoded was: � ��`����
How to say at Python that is not a text but a picture ?
By default, Python 3 opens files in text mode, that is, it tries to interpret the contents of a file as text. This is what causes the exception that you see.
Since a PDF file is (generally) a binary file, try opening the file in binary mode. In that case, read() will return a bytes object.
Here's an example (in IPython). First, opening as text:
In [1]: with open('2377_001.pdf') as pdf:
...: data = pdf.read()
...:
---------------------------------------------------------------------------
UnicodeDecodeError Traceback (most recent call last)
<ipython-input-1-d807b6ccea6e> in <module>()
1 with open('2377_001.pdf') as pdf:
----> 2 data = pdf.read()
3
/usr/local/lib/python3.6/codecs.py in decode(self, input, final)
319 # decode input (taking the buffer into account)
320 data = self.buffer + input
--> 321 (result, consumed) = self._buffer_decode(data, self.errors, final)
322 # keep undecoded input until the next call
323 self.buffer = data[consumed:]
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe2 in position 10: invalid continuation byte
Next, reading the same file in binary mode:
In [2]: with open('2377_001.pdf', 'rb') as pdf:
...: data = pdf.read()
...:
In [3]: type(data)
Out[3]: bytes
In [4]: len(data)
Out[4]: 45659
In [5]: data[:10]
Out[5]: b'%PDF-1.4\n%'
That solves the first part, how to read the data.
The second part is how to pass it to a HttpResponse. According to the Django documentation:
"Typical usage is to pass the contents of the page, as a string, to the HttpResponse constructor"
So passing bytes might or might not work (I don't have Django installed to test). The Django book says:
"content should be an iterator or a string."
I found the following gist to write binary data:
from django.http import HttpResponse
def django_file_download_view(request):
filepath = '/path/to/file.xlsx'
with open(filepath, 'rb') as fp: # Small fix to read as binary.
data = fp.read()
filename = 'some-filename.xlsx'
response = HttpResponse(mimetype="application/ms-excel")
response['Content-Disposition'] = 'attachment; filename=%s' % filename # force browser to download file
response.write(data)
return response
The problem is probably that the file you are trying to using isn't using the correct type of encoding. You can easily find the encoding of your pdf in most pdf viewers like adobe acrobat (in properties). Once you've found out what encoding it's using you can give it to Python like so:
Replace
with open(path) as pdf:
with :
with open(path, encoding="whatever encoding your pdf is in") as pdf:
Try Latin-1 encoding this often works

UnicodeDecodeError: 'utf8' codec can't decode byte 0xaf in position 3: invalid start byte in python 2.7

using windows10 python 2.7
my code for decryption
def decrypt(self, enc):
enc = b64decode(enc)
iv = enc[:16]
cipher = AES.new(self.key, AES.MODE_CBC, iv)
print cipher,"======"
dec = cipher.decrypt(enc[16:])
#print dec,"========",dec
unp = unpad(dec)
print unp,"=========","=fdkjfsdklfsdjndjdjk"
decode = unp.decode('utf8')
#decode = unp.decode('utf8')
print decode
# unpad(cipher.decrypt(enc[16:])).decode('utf8')
return decode
while decrypting the encrypted response cipher.decrypt(enc[16:]) line gives me below output. But actually It should be the XML format.
)^»3(Fm╠¡Oå┤╖¢iOÑ>s▌B¿▌╥≥┐Éj6╬░¢√(å¥ 2?J≤ôGOL═\¥°t╬╚ΓÜ▐╝Φ÷═AQw≥[&nΣ±ƒ∩(╩ûGN~[3bgrHPÜ4%╖H⌡▄wÅ|■Çq≥½÷σHñxìdºwë±!│▐íWÇÿΘ╦σ╖è#X▓┤2ÿ ┘╟ƒΣ°Y░çNßæÅαb3f«─O(Wo9┐A╕t£╧{K [X┴┬ÜHΘ⌠X4┬Æ≡~╠h3ε┘σmÉfú.Fú╜₧c!_╒▐wα²A/╒|─sY%=⌐▒Yö╕[╞ε░::tA┴₧µ≤²∙C─A█₧╕╧τ╙x≤rƒú░uú█å┬-╤`╡f╕^∞tΦ½q╗&╪─╘¥&┐Σ₧▌(╙┌JüñÇäQ¥/*ó▐H!C┬+δà\Bah╘áÆXu╥C█│¼)ë╩╓*E(÷·├à√¿╨╧1Θ·0≈º²║Ås┬xOò}a╪╔╫HÜq┬gqÅÖ⌐4~v╖·9╥Ü$wçZ▌╗┬? /Zj12^}&t$F=SBKhöåε è╝o╪█º8fìîé╫=«·gO:Z╢≡2╔K«Θ uè/╩ {⌐Åwwε^α┼µk4┘Ñ╧:ƒ16║╞ⁿB°¢üdó?eB┼P┌L_90]\5W╥µA⌐
#Mq╤ìⁿ²ç≥Θ·▓F₧▀) ç#ë╒╖às2╡}πL╕╨60ä┌ù6▒.rn╔jⁿR¢∙µIëÉ╝µè}c≈σß_αäcª/╤"lK*└qX2H öφq#â½æΘjÄ% é6#üY█▓aFßα█÷I║n+⌡▄Ä!jTÄ√∩yr¥d"╛¬z√ⁿµº½êYⁿπ¬2[╕¿≡ÿ │Uv?{τæτ°QÜĵ╨íkUFπ╚BπÆ! Hiåƒ╒£αì≥Æεtr█[╤àÆ█oíΩ("┤╞åMÜò╝D3╬¿VτΩr▓ÜÆÿ$┌)⌠≡\~╩▀Rr≡y£₧≤║L>╙ ╘µv9ÿæ├#B≡µ£╕Ew╗yÿtXeY.αÑsú Y±£∩=yy¥óüΣÆF╧╦á─} Oƒ≥-9[≤¢fúΣe3&Öÿ░ìç·ntÄO
l∙m¥\╞&KêëR»s╔E2╨ª│OV≥░m═╬2┬₧ú(ûöz¢¼╣\≤5nqò+╝±Äm{Gσ╝ROφNµàg╛RV╨;Lδa ,é/ⁿY╜|┤ñ╔÷πvⁿ╞W▓π}Rå#h$*πAò¼2╝CÅk*l"h╕≥aÆhæt)9▐░╝.]B}-╢└∩Iσw┬╚D&5≡▒²`WJ╔╫⌡K1∩ fú~A▌c▄mÑ┴?ôQ╩ƒⁿ|╨{ç▒·ΘB╡Φτ▌⌠─╘q?nⁿC/v>σ°┬#'L┌ 0Kè£
╩[Érekx«wë,\¥─K\a╡·┐PDIF╩l╤YH╞F$c6≈G¡Üc^r=pbiµΦ┘±ÿ▓zΦ¿0░ì┐á7┌o■«-ⁿ#,
While decoding i.e at line unp.decode('utf8') gives me the following error
Traceback (most recent call last):
File "nic_dycrypt_encrypt.py", line 99, in
print('Ciphertext:', AESCipher(key).decrypt(ciphertext))
File "nic_dycrypt_encrypt.py", line 86, in decrypt
decode = unp.decode('utf8').strip()
File "C:\Python27\lib\encodings\utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xaf in position 3: invalid start byte
Please any one help me to know what is that format and why the error is coming and how to resolve
Simply put not all bytes and/or byte sequences map to unicode characters. In fact most byte sequences do not have a UTF-8 character mapping.
The common solution is to convert binary to an encoding that can handle all byte values, the most common are Base64 and Hexadecimal.

Django encoding error when reading from a CSV

When I try to run:
import csv
with open('data.csv', 'rU') as csvfile:
reader = csv.DictReader(csvfile)
for row in reader:
pgd = Player.objects.get_or_create(
player_name=row['Player'],
team=row['Team'],
position=row['Position']
)
Most of my data gets created in the database, except for one particular row. When my script reaches the row, I receive the error:
ProgrammingError: You must not use 8-bit bytestrings unless you use a
text_factory that can interpret 8-bit bytestrings (like text_factory = str).
It is highly recommended that you instead just switch your application to Unicode strings.`
The particular row in the CSV that causes this error is:
>>> row
{'FR\xed\x8aD\xed\x8aRIC.ST-DENIS', 'BOS', 'G'}
I've looked at the other similar Stackoverflow threads with the same or similar issues, but most aren't specific to using Sqlite with Django. Any advice?
If it matters, I'm running the script by going into the Django shell by calling python manage.py shell, and copy-pasting it in, as opposed to just calling the script from the command line.
This is the stacktrace I get:
Traceback (most recent call last):
File "<console>", line 4, in <module>
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/csv.py", line 108, in next
row = self.reader.next()
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/codecs.py", line 302, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xcc in position 1674: invalid continuation byte
EDIT: I decided to just manually import this entry into my database, rather than try to read it from my CSV, based on Alastair McCormack's feedback
Based on the output from your question, it looks like the person who made the CSV mojibaked it - it doesn't seem to represent FRÉDÉRIC.ST-DENIS. You can try using windows-1252 instead of utf-8 but I think you'll end up with FRíŠDíŠRIC.ST-DENIS in your database.
I suspect you're using Python 2 - open() returns str which are simply byte strings.
The error is telling you that you need to decode your text to Unicode string before use.
The simplest method is to decode each cell:
with open('data.csv', 'r') as csvfile: # 'U' means Universal line mode and is not necessary
reader = csv.DictReader(csvfile)
for row in reader:
pgd = Player.objects.get_or_create(
player_name=row['Player'].decode('utf-8),
team=row['Team'].decode('utf-8),
position=row['Position'].decode('utf-8)
)
That'll work but it's ugly add decodes everywhere and it won't work in Python 3. Python 3 improves things by opening files in text mode and returning Python 3 strings which are the equivalent of Unicode strings in Py2.
To get the same functionality in Python 2, use the io module. This gives you a open() method which has an encoding option. Annoyingly, the Python 2.x CSV module is broken with Unicode, so you need to install a backported version:
pip install backports.csv
To tidy your code and future proof it, do:
import io
from backports import csv
with io.open('data.csv', 'r', encoding='utf-8') as csvfile:
reader = csv.DictReader(csvfile)
for row in reader:
# now every row is automatically decoded from UTF-8
pgd = Player.objects.get_or_create(
player_name=row['Player'],
team=row['Team'],
position=row['Position']
)
Encode Player name in utf-8 using .encode('utf-8') in player name
import csv
with open('data.csv', 'rU') as csvfile:
reader = csv.DictReader(csvfile)
for row in reader:
pgd = Player.objects.get_or_create(
player_name=row['Player'].encode('utf-8'),
team=row['Team'],
position=row['Position']
)
In Django, decode with latin-1, csv.DictReader(io.StringIO(csv_file.read().decode('latin-1'))), it would devour all special characters and all comma exceptions you get in utf-8.

scrapy: Writing unicode in a file writing Pipeline

I have a scrapy Pipeline defined that should write any Item Field crawled by a scraper to text. One of the fields contains HTML code. I'm having issues writing it to file due to the notorious Unicode error:
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa0' in position 100: ordinal not in range(128)
Scrapy can write out all of the fields as json in the logfile. Could someone explain what needs to be done to handle the character encoding for writing files? Thanks in advance.
import scrapy
import codecs
class SupportPipeline(object):
def process_item(self, item, spider):
for key, value in item.iteritems():
with codecs.open("%s.%s" % (prefix, key), 'wb', 'utf-8') as f:
# with open("%s.%s" % (prefix, key), 'wb') as f:
f.write(value.encode('utf-8'))
return item

Python: POSTing binary data gives UnicodeDecodeError or Ascii decode error

When POSTing binary data using urllib2 or urllib3, or httplib2, I receive the error UnicodeDecodeError: 'utf8' codec can't decode or UnicodeDecodeError: 'ascii' codec can't decode... depending on whether the Python script is in UniCode or ASCII mode.
I first thought that the library was the issue, so I tried different libraries but that didn't solve the problem.
End of the stack trace:
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/httplib.py", line 895, in _send_output
msg += message_body
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/encodings/utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xc4 in position 627: invalid continuation byte
The problem, as noted in comments to Python bug 11898 is that the url string became tagged at some point as either a Unicode or Ascii string.
Then, when the httplib library is creating the byte string for the entire HTTP/S message, and the line
msg += message_body
is executed, Python tries to convert message_body (which contains binary data) to either Ascii or Unicode. In either case, the conversion fails.
Solution
use str() when making any modifications to the url. In my case:
url = baseUrl + "/envelopes" # throws UnicodeDecodeError
url = str(baseUrl + "/envelopes") # works great
If that is not enough, check your other strings to ensure that they haven't been tagged as Unicode or Ascii.