scrapy: Writing unicode in a file writing Pipeline - python-2.7

I have a scrapy Pipeline defined that should write any Item Field crawled by a scraper to text. One of the fields contains HTML code. I'm having issues writing it to file due to the notorious Unicode error:
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa0' in position 100: ordinal not in range(128)
Scrapy can write out all of the fields as json in the logfile. Could someone explain what needs to be done to handle the character encoding for writing files? Thanks in advance.
import scrapy
import codecs
class SupportPipeline(object):
def process_item(self, item, spider):
for key, value in item.iteritems():
with codecs.open("%s.%s" % (prefix, key), 'wb', 'utf-8') as f:
# with open("%s.%s" % (prefix, key), 'wb') as f:
f.write(value.encode('utf-8'))
return item

Related

open pdf without text with python

I want open a PDF for a Django views but my PDF has not a text and python returns me a blank PDF.
On each page, this is a scan of a page : link
from django.http import HttpResponse
def views_pdf(request, path):
with open(path) as pdf:
response = HttpResponse(pdf.read(),content_type='application/pdf')
response['Content-Disposition'] = 'inline;elec'
return response
Exception Type: UnicodeDecodeError
Exception Value: 'charmap' codec can't decode byte 0x9d in position 373: character maps to < undefined >
Unicode error hint
The string that could not be encoded/decoded was: � ��`����
How to say at Python that is not a text but a picture ?
By default, Python 3 opens files in text mode, that is, it tries to interpret the contents of a file as text. This is what causes the exception that you see.
Since a PDF file is (generally) a binary file, try opening the file in binary mode. In that case, read() will return a bytes object.
Here's an example (in IPython). First, opening as text:
In [1]: with open('2377_001.pdf') as pdf:
...: data = pdf.read()
...:
---------------------------------------------------------------------------
UnicodeDecodeError Traceback (most recent call last)
<ipython-input-1-d807b6ccea6e> in <module>()
1 with open('2377_001.pdf') as pdf:
----> 2 data = pdf.read()
3
/usr/local/lib/python3.6/codecs.py in decode(self, input, final)
319 # decode input (taking the buffer into account)
320 data = self.buffer + input
--> 321 (result, consumed) = self._buffer_decode(data, self.errors, final)
322 # keep undecoded input until the next call
323 self.buffer = data[consumed:]
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe2 in position 10: invalid continuation byte
Next, reading the same file in binary mode:
In [2]: with open('2377_001.pdf', 'rb') as pdf:
...: data = pdf.read()
...:
In [3]: type(data)
Out[3]: bytes
In [4]: len(data)
Out[4]: 45659
In [5]: data[:10]
Out[5]: b'%PDF-1.4\n%'
That solves the first part, how to read the data.
The second part is how to pass it to a HttpResponse. According to the Django documentation:
"Typical usage is to pass the contents of the page, as a string, to the HttpResponse constructor"
So passing bytes might or might not work (I don't have Django installed to test). The Django book says:
"content should be an iterator or a string."
I found the following gist to write binary data:
from django.http import HttpResponse
def django_file_download_view(request):
filepath = '/path/to/file.xlsx'
with open(filepath, 'rb') as fp: # Small fix to read as binary.
data = fp.read()
filename = 'some-filename.xlsx'
response = HttpResponse(mimetype="application/ms-excel")
response['Content-Disposition'] = 'attachment; filename=%s' % filename # force browser to download file
response.write(data)
return response
The problem is probably that the file you are trying to using isn't using the correct type of encoding. You can easily find the encoding of your pdf in most pdf viewers like adobe acrobat (in properties). Once you've found out what encoding it's using you can give it to Python like so:
Replace
with open(path) as pdf:
with :
with open(path, encoding="whatever encoding your pdf is in") as pdf:
Try Latin-1 encoding this often works

Django encoding error when reading from a CSV

When I try to run:
import csv
with open('data.csv', 'rU') as csvfile:
reader = csv.DictReader(csvfile)
for row in reader:
pgd = Player.objects.get_or_create(
player_name=row['Player'],
team=row['Team'],
position=row['Position']
)
Most of my data gets created in the database, except for one particular row. When my script reaches the row, I receive the error:
ProgrammingError: You must not use 8-bit bytestrings unless you use a
text_factory that can interpret 8-bit bytestrings (like text_factory = str).
It is highly recommended that you instead just switch your application to Unicode strings.`
The particular row in the CSV that causes this error is:
>>> row
{'FR\xed\x8aD\xed\x8aRIC.ST-DENIS', 'BOS', 'G'}
I've looked at the other similar Stackoverflow threads with the same or similar issues, but most aren't specific to using Sqlite with Django. Any advice?
If it matters, I'm running the script by going into the Django shell by calling python manage.py shell, and copy-pasting it in, as opposed to just calling the script from the command line.
This is the stacktrace I get:
Traceback (most recent call last):
File "<console>", line 4, in <module>
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/csv.py", line 108, in next
row = self.reader.next()
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/codecs.py", line 302, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xcc in position 1674: invalid continuation byte
EDIT: I decided to just manually import this entry into my database, rather than try to read it from my CSV, based on Alastair McCormack's feedback
Based on the output from your question, it looks like the person who made the CSV mojibaked it - it doesn't seem to represent FRÉDÉRIC.ST-DENIS. You can try using windows-1252 instead of utf-8 but I think you'll end up with FRíŠDíŠRIC.ST-DENIS in your database.
I suspect you're using Python 2 - open() returns str which are simply byte strings.
The error is telling you that you need to decode your text to Unicode string before use.
The simplest method is to decode each cell:
with open('data.csv', 'r') as csvfile: # 'U' means Universal line mode and is not necessary
reader = csv.DictReader(csvfile)
for row in reader:
pgd = Player.objects.get_or_create(
player_name=row['Player'].decode('utf-8),
team=row['Team'].decode('utf-8),
position=row['Position'].decode('utf-8)
)
That'll work but it's ugly add decodes everywhere and it won't work in Python 3. Python 3 improves things by opening files in text mode and returning Python 3 strings which are the equivalent of Unicode strings in Py2.
To get the same functionality in Python 2, use the io module. This gives you a open() method which has an encoding option. Annoyingly, the Python 2.x CSV module is broken with Unicode, so you need to install a backported version:
pip install backports.csv
To tidy your code and future proof it, do:
import io
from backports import csv
with io.open('data.csv', 'r', encoding='utf-8') as csvfile:
reader = csv.DictReader(csvfile)
for row in reader:
# now every row is automatically decoded from UTF-8
pgd = Player.objects.get_or_create(
player_name=row['Player'],
team=row['Team'],
position=row['Position']
)
Encode Player name in utf-8 using .encode('utf-8') in player name
import csv
with open('data.csv', 'rU') as csvfile:
reader = csv.DictReader(csvfile)
for row in reader:
pgd = Player.objects.get_or_create(
player_name=row['Player'].encode('utf-8'),
team=row['Team'],
position=row['Position']
)
In Django, decode with latin-1, csv.DictReader(io.StringIO(csv_file.read().decode('latin-1'))), it would devour all special characters and all comma exceptions you get in utf-8.

Return tweets based on hashtag and save them to file.txt

I would like to retreive tweets on a specific date based on their hashtag. For the purpose I'm using tweepy and the following code:
results = api.search('#brexit OR #EUref', since="2016-06-24",
until="2016-06-30", monitor_rate_limit=True,wait_on_rate_limit=True)
with open('24june_bx.txt', 'w') as f:
for tweet in results:
try:
f.write('{}\n'.format(tweet.text.decode('utf-8')))
except BaseException as e:
print 'ascii codec can\'t encode characters'
continue
As you can see, I'm trying to get all the tweets with the hashtag '#brexit' or 'EUref', the day after the vote and store them in the file '24june_bx.txt'.
It kind of works... but in the file I only get about 10 tweets. The terminal also reports 7 times the exception and prints 'ascii codec...'.
What do you think may be the problem?
Sorry for the noobish question.
Many thanks.
You can use Tweepy's Cursor in conjunction with api.search to get as many tweets as you want.
def search_tweets_from_twitter_home(query, max_tweets, from_date, to_date):
"""search using twitter search_home. "result_type=mixed" means both
'recent' & 'popular' tweets will be returned in search results.
returns the generator (for memory efficiency)
"""
searched_tweets = ( status._json for status in tweepy.Cursor(api.search,
q=query, count=300, since=from_date, until=to_date,
result_type="mixed", lang="en" ).items(max_tweets) )
return searched_tweets
This will return as many tweets as you mention in max_tweets, assuming that that many tweets are available to return.
You can then iterate over the generator and write it to a file.
Use the io lib, setting the encoding to utf-8 to handle your encoding errors:
import io
with io.open('24june_bx.txt', 'w', encoding="utf-8") as f:
for tweet in results:
try:
f.write(u'{}\n'.format(tweet.text))
except UnicodeEncodeError as e:
print(e)
If you use the regular open you need to encode to utf-8 as you already have a unicode string:
with open('24june_bx.txt', 'w') as f:
for tweet in results:
try:
f.write('{}\n'.format(tweet.text.encode("utf-8")))
except UnicodeEncodeError as e:
print(e)
'#brexit OR #EUref'
I think using this as the search query will return tweets which contain that particular string. Try using only '#brexit' and '#EUref' and later concatenating the results.
Try adding
# -*- coding: utf-8 -*-
at the first line of your script

Python 2.7 and Textblob - TypeError: The `text` argument passed to `__init__(text)` must be a string, not <type 'list'>

Update: Issue resolved. (see comment section below.) Ultimately, the following two lines were required to transform my .csv to unicode and utilize TextBlob: row = [cell.decode('utf-8') for cell in row], and text = ' '.join(row).
Original question:
I am trying to use a Python library called Textblob to analyze text from a .csv file. Error I receive when I call Textblob in my code is:
Traceback (most recent call last): File
"C:\Users\Marcus\Documents\Blog\Python\Scripts\Brooks\textblob_sentiment.py",
line 30, in
blob = TextBlob(row) File "C:\Python27\lib\site-packages\textblob\blob.py", line 344, in
init
'must be a string, not {0}'.format(type(text)))TypeError: The text argument passed to __init__(text) must be a string, not
My code is:
#from __future__ import division, unicode_literals #(This was recommended for Python 2.x, but didn't help in my case.)
#-*- coding: utf-8 -*-
import csv
from textblob import TextBlob
with open(u'items.csv', 'rb') as scrape_file:
reader = csv.reader(scrape_file, delimiter=',', quotechar='"')
for row in reader:
row = [unicode(cell, 'utf-8') for cell in row]
print row
blob = TextBlob(row)
print type(blob)
I have been working through UTF/unicode issues. I'd originally had a different subject which I posed to this thread. (Since my code and the error have changed, I'm posting to a new thread.) Print statements indicate that the variable "row" is of type=str, which I thought indicated that the reader object had been transformed as required by Textblob. The source .csv file is saved as UTF-8. Can anyone provide feedback as to how I can get unblocked on this, and the flaws in my code?
Thanks so much for the help.
So maybe you can make change as below:
row = str([cell.encode('utf-8') for cell in row])

UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 1: ordinal not in range(128)

I know there is existing title about this, but there question is different from mine. So here's my problem. I use context processor to display user name. It's working but my sentry detect an error yesterday.
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 1: ordinal not in range(128)
Here is my code:
def display_name(request):
try:
name = "{0} {1}".format(request.user.first_name, request.user.last_name)
name = name.strip()
if not name:
name = request.user.username
except AttributeError:
name = None
return {'display_name': name,}
What's the cause of this? Or the user input character for their name?
It's basically a user input problem.
Text encodings are a whole "thing" and hard to get into, but in a nut shell, a user entered a Unicode character that can't easily be mapped to an ASCII character.
You can fix this by changing this:
name = "{0} {1}".format(request.user.first_name, request.user.last_name)
To this:
name = u"{0} {1}".format(request.user.first_name, request.user.last_name)
This tells Python to treat the string as a unicode string (which has all the normal functions as an ascii string).