Using Python codecs but still getting UnicodeDecodeError

Using Python codecs but still getting UnicodeDecodeError - python-2.7

I have a non-English list of rows where each row is a list of strings and ints. I need to write this data to a file and convert all numbers to strings accordingly.
The data contents is the following:
[[u'12', u'as', u'ss', u'ge', u'ge', u'm\xfcnze', u'10.0', u'25.2', u'68.05', 1, 2, 0],
[u'13', u'aas', u'sss', u'tge', u'a', u'mat', u'11.0', u'35.7', u'10.1', 1, 1, 1], ...]
The loop breaks on the first list which contains u'm\xfcnze'.
import codecs
with codecs.open("temp.txt", "w", encoding="utf-8") as f:
for row in data:
f.write(' '.join([str(r) for r in row]))
f.write('\n')
The code above fails with UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 38: ordinal not in range(128) error.
Trying r.encode('utf-8') if isinstance(r, str) does not solve this issue, so what am I doing wrong?

This should work:
import codecs
with codecs.open("temp.txt", "w", encoding="utf-8") as f:
for row in data:
f.write(' '.join([unicode(r) for r in row]))
f.write('\n')
I'm using the unicode() function
Note, because Python 3 string data type is string unicode, your code works fine in Python 3 without any modification (no str -> unicode needed)

Related

How to fix UnicodeEncodeError: in Pyspark while converting Dataframe Row to a String

I have a simple dataframe with 3 columns.
+------------------+-------------------+-------+
| NM1_PROFILE| CURRENT_DATEVALUE| ID|
+------------------+-------------------+-------+
|XY_12345678 – Main|2019-12-19 00:00:00|myuser1|
+------------------+-------------------+-------+
All i want in the output is a single string consists of all the values in dataframe row separated by comma or pipe. Although there are many rows in the dataframe, i just want 1 row to solve my purpose.
XY_12345678 – Main,2019-12-19 00:00:00,myuser1
I have tried with below and it has worked fine for my other dataframes but for above it gives me an error.
df.rdd.map(lambda line: ",".join([str(x) for x in line])).take(1)[0]
Error when it encounters "-"
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2013' in position 12: ordinal not in range(128)
I am using Spark 1.6 with Python 2 and tried -
import sys
reload(sys)
sys.setdefaultencoding('utf8')

According to the Spark 1.6 Documentation, you can use the concat_ws function, which giving a separator and a set of columns, it will concat them in one string. So this should solve your issue
from pyspark.sql.functions import col, concat_ws
df.select(concat_ws(",", col("NM1_PROFILE"), col("CURRENT_DATEVALUE"), col("ID")).alias("concat")).collect()
Or, if you prefer a more generic way, you can use something like this:
from pyspark.sql.functions import col, concat_ws
cols = [col(column) for column in df.columns]
df.select(concat_ws(",", *cols).alias("concat")).collect()
For more information: https://spark.apache.org/docs/1.6.0/api/python/pyspark.sql.html#pyspark.sql.functions.concat_ws
Hope this helps

Pandas convert object column to str - column contains unicode, float etc

I have pandas data frame where column type shows as object but when I try to convert to string,
df['column'] = df['column'].astype('str')
UnicodeEncodeError get thrown:
*** UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-5: ordinal not in range(128)
My next approach was to handle the encoding part:
df['column'] = filtered_df['column'].apply(lambda x: x.encode('utf-8').strip())
But that gives following error:
*** AttributeError: 'float' object has no attribute 'encode'
Whats the best approach to convert this column to string.
Sample of string in the column
Thank you :)
Thank You !!!
responsibilities/assigned job.

I had the same problem in python 2.7 when trying to run a script that was originally intended for python 3. In python 2.7, the default str functionality is to encode to ASCII, which will apparently not work with your data. This can be replicated in a simple example:
import pandas as pd
df = pd.DataFrame({'column': ['asdf', u'uh ™ oh', 123]})
df['column'] = df['column'].astype('str')
Results in:
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2122' in position 3: ordinal not in range(128)
Instead, you can specify unicode:
df['column'] = df['column'].astype('unicode')
Verify that the number has been converted to a string:
df['column'][2]
This outputs u'123', so it has been converted to a unicode string. The special character ™ has been properly preserved as well.

UnicodeEncodeError: 'ascii' codec can't encode characters in position 321-322: ordinal not in range(128)

I am trying to join column values in xlsx file using pandas. I am using the below code to that.
(df.astype(str).groupby('name', as_index=False, sort=False)
.apply(lambda x: pd.Series({v: ','.join(x[v].unique()) for v in x})))
But, I am getting error like
UnicodeEncodeError: 'ascii' codec can't encode characters in position 321-322: ordinal not in range(128)

If you only need string for your DataFrame, you can use the option dtype = unicode in your read_excel function and remove the astype(str).

UnicodeEncodeError when writing the selected file

I made several attempts of other questions already answered on the subject plus my code always returns the error.
The only purpose of this code is to just put the tag in the sentences of a document and dump to a file the sentences that contain more than N occurrences of a particular POS of your choice:
import os
import nlpnet
import codecs
TAGGER = nlpnet.POSTagger('pos-pt', language='pt')
# You could have a function that tagged and verified if a
# sentence meets the criteria for storage.
def is_worth_saving(text, pos, pos_count):
# tagged sentences are lists of tagged words, which in
# nlpnet are (word, pos) tuples. Tagged texts may contain
# several sentences.
pos_words = [word for sentence in TAGGER.tag(text)
for word in sentence
if word[1] == pos]
return len(pos_words) >= pos_count
with codecs.open('dataset.txt', encoding='utf8') as original_file:
with codecs.open('dataset_new.txt', 'w') as output_file:
for text in original_file:
# For example, only save sentences with more than 5 verbs in it
if is_worth_saving(text, 'V', 5):
output_file.write(text + os.linesep)
Error compiled:
Traceback (most recent call last):
File "D:/Word Sorter/Classifier.py", line 31, in <module>
output_file.write(text + os.linesep)
UnicodeEncodeError: 'ascii' codec can't encode characters in position 161-162: ordinal not in range(128)

Have you seen these questions before?
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa0' in position 20: ordinal not in range(128) and Again: UnicodeEncodeError: ascii codec can't encode
It is exactly the same as your error. So my guess is that you will need to encode your text using text.encode('utf8').
EDIT:
Try using it here:
output_file.write(text.encode('utf8') + os.linesep)

pandas dataframe and u'\u2019'

I have a pandas dataframe (python 2.7) containing a u'\u2019' that does not let me extract as csv my result.
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2019' in position 180: ordinal not in range(128)
Is there a way to query the dataframe and substitude these character with another one?

Try using a different encoding when saving to file (the default in pandas for Python 2.x is ascii, that's why you get the error since it can't handle unicode characters):
df.to_csv(path, encoding='utf-8')

I did not manage to export the whole file. However, I managed to identity the row with the character causing problems and eliminate it
faulty_rows = []
for i in range(len(outcome)):
try:
test = outcome.iloc[i]
test.to_csv("/Users/john/test/test.csv")
except:
pass
faulty_rows.append(i)
print i
tocsv = tocsv.drop(outcome.index[[indexes]])
tocsv.to_csv("/Users/john/test/test.csv")

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Using Python codecs but still getting UnicodeDecodeError - python-2.7

Related

How to fix UnicodeEncodeError: in Pyspark while converting Dataframe Row to a String

Pandas convert object column to str - column contains unicode, float etc

UnicodeEncodeError: 'ascii' codec can't encode characters in position 321-322: ordinal not in range(128)

UnicodeEncodeError when writing the selected file

pandas dataframe and u'\u2019'

Categories

Resources