UnicodeEncodeError when writing the selected file - python-2.7

I made several attempts of other questions already answered on the subject plus my code always returns the error.
The only purpose of this code is to just put the tag in the sentences of a document and dump to a file the sentences that contain more than N occurrences of a particular POS of your choice:
import os
import nlpnet
import codecs
TAGGER = nlpnet.POSTagger('pos-pt', language='pt')
# You could have a function that tagged and verified if a
# sentence meets the criteria for storage.
def is_worth_saving(text, pos, pos_count):
# tagged sentences are lists of tagged words, which in
# nlpnet are (word, pos) tuples. Tagged texts may contain
# several sentences.
pos_words = [word for sentence in TAGGER.tag(text)
for word in sentence
if word[1] == pos]
return len(pos_words) >= pos_count
with codecs.open('dataset.txt', encoding='utf8') as original_file:
with codecs.open('dataset_new.txt', 'w') as output_file:
for text in original_file:
# For example, only save sentences with more than 5 verbs in it
if is_worth_saving(text, 'V', 5):
output_file.write(text + os.linesep)
Error compiled:
Traceback (most recent call last):
File "D:/Word Sorter/Classifier.py", line 31, in <module>
output_file.write(text + os.linesep)
UnicodeEncodeError: 'ascii' codec can't encode characters in position 161-162: ordinal not in range(128)

Have you seen these questions before?
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa0' in position 20: ordinal not in range(128) and Again: UnicodeEncodeError: ascii codec can't encode
It is exactly the same as your error. So my guess is that you will need to encode your text using text.encode('utf8').
EDIT:
Try using it here:
output_file.write(text.encode('utf8') + os.linesep)

Related

Pandas convert object column to str - column contains unicode, float etc

I have pandas data frame where column type shows as object but when I try to convert to string,
df['column'] = df['column'].astype('str')
UnicodeEncodeError get thrown:
*** UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-5: ordinal not in range(128)
My next approach was to handle the encoding part:
df['column'] = filtered_df['column'].apply(lambda x: x.encode('utf-8').strip())
But that gives following error:
*** AttributeError: 'float' object has no attribute 'encode'
Whats the best approach to convert this column to string.
Sample of string in the column
Thank you :)
Thank You !!!
responsibilities/assigned job.
I had the same problem in python 2.7 when trying to run a script that was originally intended for python 3. In python 2.7, the default str functionality is to encode to ASCII, which will apparently not work with your data. This can be replicated in a simple example:
import pandas as pd
df = pd.DataFrame({'column': ['asdf', u'uh ™ oh', 123]})
df['column'] = df['column'].astype('str')
Results in:
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2122' in position 3: ordinal not in range(128)
Instead, you can specify unicode:
df['column'] = df['column'].astype('unicode')
Verify that the number has been converted to a string:
df['column'][2]
This outputs u'123', so it has been converted to a unicode string. The special character ™ has been properly preserved as well.

UnicodeEncodeError: 'ascii' codec can't encode characters in position 321-322: ordinal not in range(128)

I am trying to join column values in xlsx file using pandas. I am using the below code to that.
(df.astype(str).groupby('name', as_index=False, sort=False)
.apply(lambda x: pd.Series({v: ','.join(x[v].unique()) for v in x})))
But, I am getting error like
UnicodeEncodeError: 'ascii' codec can't encode characters in position 321-322: ordinal not in range(128)
If you only need string for your DataFrame, you can use the option dtype = unicode in your read_excel function and remove the astype(str).

Using Python codecs but still getting UnicodeDecodeError

I have a non-English list of rows where each row is a list of strings and ints. I need to write this data to a file and convert all numbers to strings accordingly.
The data contents is the following:
[[u'12', u'as', u'ss', u'ge', u'ge', u'm\xfcnze', u'10.0', u'25.2', u'68.05', 1, 2, 0],
[u'13', u'aas', u'sss', u'tge', u'a', u'mat', u'11.0', u'35.7', u'10.1', 1, 1, 1], ...]
The loop breaks on the first list which contains u'm\xfcnze'.
import codecs
with codecs.open("temp.txt", "w", encoding="utf-8") as f:
for row in data:
f.write(' '.join([str(r) for r in row]))
f.write('\n')
The code above fails with UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 38: ordinal not in range(128) error.
Trying r.encode('utf-8') if isinstance(r, str) does not solve this issue, so what am I doing wrong?
This should work:
import codecs
with codecs.open("temp.txt", "w", encoding="utf-8") as f:
for row in data:
f.write(' '.join([unicode(r) for r in row]))
f.write('\n')
I'm using the unicode() function
Note, because Python 3 string data type is string unicode, your code works fine in Python 3 without any modification (no str -> unicode needed)

'ascii' codec can't decode byte 0xdb in position 942: ordinal not in range(128) SQLAlchemy (Django)

I use SQLAlchemy query with utf-8 encode when i use run query on mysqldb i get output, but run code on python i get error :
'ascii' codec can't decode byte 0xdb in position 942: ordinal not in range(128)
query :
query = """SELECT * FROM (SELECT p.ID AS 'persons_ID', p.FirstName AS 'persons_FirstName', p.LastName AS 'persons_LastName',p.NationalCode AS 'persons_NationalCode', p.CityID AS 'persons_CityID', p.Mobile AS 'persons_Mobile',p.Address AS 'persons_Address', cities_1.ID AS 'cities_1_ID', cities_1.Name AS 'cities_1_Name',cities_1.ParentID AS 'cities_1_ParentID', cities_2.ID AS 'cities_2_ID', cities_2.Name AS 'cities_2_Name',cities_2.ParentID AS 'cities_2_ParentID' , cast(#row := #row + 1 as unsigned) as 'persons_row_number' FROM Persons p LEFT OUTER JOIN cities AS cities_2 ON cities_2.ID = p.CityID LEFT OUTER JOIN cities AS cities_1 ON cities_1.ID = cities_2.ParentID , (select #row := 0) as init WHERE 1=1 AND p.FirstName LIKE N'{}%'""".format('رامین')
Conntector charset Mysql :
e = create_engine("mysql+pymysql://#localhost/test?charset=utf8")
do you have idea for resolve ?
Thanks,
Python 2 uses bytestrings (ASCII) strings by default, which support only Latin characters. Python 3 uses Unicode strings by default.
As I see you use some Arabic script in your query and therefore you probably get some in response. The error says, that, obviously, Python can't decode Arabic characters to ASCII. To handle Arabic (or any other non-Latin) characters you have to use unicode in Python. Note: it has nothing to do with unicode setting you provide, which affects only the database.
So your options are:
Switch to Python 3.
Stay as you are, but add from __future__ import unicode_literals at the start of your every module to enable using unicode for strings by default.
Use encode/decode everytime to manipulate with unicode and bytestrings, but it's the worst solution.

pandas dataframe and u'\u2019'

I have a pandas dataframe (python 2.7) containing a u'\u2019' that does not let me extract as csv my result.
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2019' in position 180: ordinal not in range(128)
Is there a way to query the dataframe and substitude these character with another one?
Try using a different encoding when saving to file (the default in pandas for Python 2.x is ascii, that's why you get the error since it can't handle unicode characters):
df.to_csv(path, encoding='utf-8')
I did not manage to export the whole file. However, I managed to identity the row with the character causing problems and eliminate it
faulty_rows = []
for i in range(len(outcome)):
try:
test = outcome.iloc[i]
test.to_csv("/Users/john/test/test.csv")
except:
pass
faulty_rows.append(i)
print i
tocsv = tocsv.drop(outcome.index[[indexes]])
tocsv.to_csv("/Users/john/test/test.csv")