Reading and writing UTF-8 from file - python-2.7

I have some text encoded in UTF-8. 'Before – after.' It was fetched from the web. The '–' character is the issue. If you try to print directly from the command line, using copy and paste:
>>> text = 'Before – after.'
>>> print text
Before – after.
But if you save to a text file and try to print:
>>> for line in ('file.txt','r'):
>>> print line
Before û after.
Im pretty sure this is some sort of UTF-8 encode/decode error, but it is eluding me. I have tried to decode, or re-encode but that is not it either.
>>> for line in ('file.txt','r'):
>>> print line.decode('utf-8')
UnicodeDecodeError: 'utf8' codec can't decode byte 0x96 in position 7: invalid start byte
>>> for line in ('file.txt','r'):
>>> print line.encode('utf-8')
UnicodeDecodeError: 'utf8' codec can't decode byte 0x96 in position 7: invalid start byte

It's happening because a non-ascii character cannot be encoded or decoded. You can strip it out and then print the ascii values.
Take a look at this question : UnicodeDecodeError: 'utf8' codec can't decode byte 0xa5 in position 0: invalid start byte

Related

python .format() can't interpolate accented letters - Python 2.7

I'm trying to interpolate strings that are saved correctly with accented letters in a database. When I recover them I have an error:
'<html><div>{ragioneSociale}{iva}{sdi}{cuu}{indirizzo}{metodoDiPagamento}{iban_bic}</div></html>'.format(
ragioneSociale=generaleViewRes.getString('ragioneSociale'),
iva=generaleViewRes.getString('iva'),
sdi=generaleViewRes.getString('sdi'),
cuu=generaleViewRes.getString('cuu'),
indirizzo=generaleViewRes.getString('indirizzo'),
metodoDiPagamento=generaleViewRes.getString('metodoDiPagamento'),
iban_bic=generaleViewRes.getString('iban_bic')
)
Then I tried to use encode('utf-8') on each single element individually, then encode('utf-8').decode('utf-8') and finally .decode('utf-8'). The errors were:
UnicodeEncodeError: 'ascii' codec can't encode character u'\xd2' in position 57: ordinal not in range(128)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xd2' in position 57: ordinal not in range(128)
UnicodeDecodeError: 'utf-8' codec can't decode bytes in position 57-58: invalid data
Unfortunately this is an error that I often find with the .format method, which I had solved in smaller contexts using the + operator. The fact is that large interpolations cannot be solved by using the + operator for a matter of readability and for other problems that the .format provides. I wonder, is it possible that this problem has never been solved?
Thanks in advance.

How to interpret Unicode notation in Python?

How to convert formal Unicode notation like 'U+1F600' into something like this: '\U0001F600', which I saw represented as 'Python Src' at websites online?
My end-goal is to use Unicode for emojis in Python(2.x) and I am able to achieve it in this way:
unicode_string = '\U0001F600'
unicode_string.decode('unicode-escape')
I would appreciate if you could mention the different character sets involved in the above problem.
The simplest way to do it is to just treat the notation as a string:
>>> s = 'U+1F600'
>>> s[2:] # chop off the U+
'1F600'
>>> s[2:].rjust(8, '0') # pad it to 8 characters with 0s
'0001F600'
>>> r'\U' + s[2:].rjust(8, '0') # prepend the `\U`
'\\U0001F600'
It might be a bit cleaner to parse the string as hex and then format the resulting number back out:
>>> int(s[2:], 16)
128512
>>> n = int(s[2:], 16)
>>> rf'\U{n:08X}'
'\\U0001F600'
… but I'm not sure it's really any easier to understand that way.
If you need to extract these from a larger string, you probably want a regular expression.
We want to match a literal U+ followed by 1 to 8 hex digits, right? So, that's U\+[0-9a-fA-F]{1,8}. Except we really don't need to include the U+ just to pull it off with [2:], so let's group the rest of it: U\+([0-9a-fA-F]{1,8}).
>>> s = 'Hello U+1F600 world'
>>> re.search(r'U\+([0-9a-fA-F]{1,8})', s)
<_sre.SRE_Match object; span=(6, 13), match='U+1F600'>
>>> re.search(r'U\+([0-9a-fA-F]{1,8})', s).group(1)
'1F600'
Now, we can use re.sub with a function to apply the \U prepending and rjust padding:
>>> re.sub(r'U\+([0-9a-fA-F]{1,8})', lambda match: r'\U' + match.group(1).rjust(8, '0'), s)
'Hello \\U0001F600 world'
That's probably more readable if you define the function out-of-line:
>>> def padunimatch(match):
... return r'\U' + match.group(1).rjust(8, '0')
>>> re.sub(r'U\+([0-9a-fA-F]{1,8})', padunimatch, s)
'Hello \\U0001F600 world'
Or, if you prefer to do it numerically:
>>> def padunimatch(match):
... n = int(match.group(1), 16)
... return rf'\U{n:08X}'
>>> re.sub(r'U\+([0-9a-fA-F]{1,8})', padunimatch, s)
'Hello \\U0001F600 world'
And of course you already know how to do the last part, because it's in your question, right? Well, not quite: you can't call decode on a string, only on a bytes. The simplest way around this is to use the codec directly:
>>> x = 'Hello \\U0001F600 world'
>>> codecs.decode(x, 'unicode_escape')
'Hello 😀 world'
… unless you're using Python 2. In that case, the str type isn't a Unicode string, it's a byte-string, so decode actually works fine. But in Python 2, you'll run into other problems, unless all of your text is pure ASCII (with any non-ASCII characters encoded as U+xxxx sequences).
For example, let's say your input was:
>>> s = 'Hej U+1F600 världen'
In Python 3, that's fine. That s is a Unicode string. Under the covers, my console is sending Python UTF-8-encoded bytes to standard input and expecting to get UTF-8-encoded bytes back from standard output, but that just works like magic. (Well, not quite magic—you can print(sys.stdin.encoding, sys.stdout.encoding) to see that Python knows my console is UTF-8 and uses that to decode and encode on my behalf.)
In Python 2, it's not. If my console is UTF-8, what I've actually done is equivalent to:
>>> s = 'Hej U+1F600 v\xc3\xa4rlden'
… and if I try to decode that as unicode-escape, Python 2 will treat those \xc3 and \xa4 bytes as Latin-1 bytes, rather than UTF-8:
>>> s = 'Hej \U0001F600 v\xc3\xa4rlden'
… so what you end up with is:
>>> s.decode('unicode_escape')
u'Hej \U0001f600 v\xc3\xa4rlden'
>>> print(s.decode('unicode_escape'))
Hej 😀 världen
But what if you try to decode it as UTF-8 first, and then decode that as unicode_escape?
>>> s.decode('utf-8')
u'Hej \\U0001F600 v\xe4rlden'
>>> print(s.decode('utf-8'))
Hej \U0001F600 världen
>>> s.decode('utf-8').decode('unicode-escape')
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe4' in position 16: ordinal not in range(128)
Unlike Python 3, which just won't let you call decode on a Unicode string, Python 2 lets you do it—but it handles it by trying to encode to ASCII first, so it has something to decode, and that obviously fails here.
And you can't just use the codec directly, the way you can in Python 3:
>>> codecs.decode(s.decode('utf-8'), 'unicode_escape')
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe4' in position 16: ordinal not in range(128)
You could decode the UTF-8, then unicode-escape the result, then un-unicode-escape everything, but even that isn't quite right:
>>> print(s.decode('utf-8').encode('unicode_escape').decode('unicode_escape'))
Hej \U0001F600 världen
Why? Because unicode-escape, while fixing our existing Unicode character, also escaped our backslash!
If you know you definitely have no \U escapes in the original source that you didn't want parsed, there's a quick fix for this: just replace the escaped backslash:
>>> print(s.decode('utf-8').encode('unicode_escape').replace(r'\\U', r'\U').decode('unicode_escape'))
Hej 😀 världen
If this all seems like a huge pain… well, yeah, that's why Python 3 exists, because dealing with Unicode properly in Python 2 (and notice that I didn't even really deal with it properly…) is a huge pain.

python + unicodeEncodeError \xb5 while reading from excel and writing to msqldatabase

I have a python 2.7 script that reads data from an excel file where it is possible that the user uses special characters (e.g. µ). and write it in a msqldatabase.
I've added the next code on top f the file:
# -*- coding: utf-8 -*-
But it still uses the ascii codec. How can I solve this error.
This is the errocode:
UnicodeEncodeError: 'ascii' codec can't encode character u'\xb5' in position 19: ordinal not in range(128)
tx in advance
I have faced this issue while inserting values from 3rd party app. over there I used to insert values with escape string.
from re import escape
r = escape('µ')
Result:
'\\\xc2\\\xb5'
in insert statement pass r variable value.

python print str.decode("utf-8") UnicodeEncodeError

I want to convert a python string (utf-8) to unicode.
word = "3——5" # —— is u'\u2013', not a english symbol
print type(word).__name__ # output is str
print repr(word) # output is '3\xe2\x80\x935'
print word.decode("utf-8", errors='ignore')
I got this error
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2013' in position 1: ordinal not in range(128)
But when I change
word.decode("utf-8", errors='ignore')
to
word.decode(errors='ignore')
the error disappears.
Why? word is a utf-8 string, why can't i specify utf-8 to decode?

Replace utf8 characters

I want to replace some utf-8 characters set with another utf-8 character set but anything I try I end up with errors.
I am a noob at Python so please be patient
What I want to achieve is converting characters by unicode values or by html entities (more readable, for maintanance)
Tries (with example):
1.First
#!/usr/bin/env python
# -*- coding: utf-8 -*-
#Found this function
def multiple_replace(dic, text):
pattern = "|".join(map(re.escape, dic.keys()))
return re.sub(pattern, lambda m: dic[m.group()], text)
text="Larry Wall is ùm© some text"
replace_table = {
u'\x97' : u'\x82' # ù -> é
}
text2=multiple_replace(dic,text)
print text #Expected:Larry Wall is ém© some text
#Got: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal
2.Html entities
dic = {
"ú" : "é" # ù -> é
}
some_text="Larry Wall is ùm© some text"
some_text2=some_text.encode('ascii', 'xmlcharrefreplace')
some_text2=multiple_replace(dic,some_text2)
print some_text2
#Got:UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 14: ordinal not in range(128)
Any ideas are welcome
Your problem is due to the fact that your input strings are in non-unicode representation (<type 'str'> rather than <type 'unicode'>). You must define the input string using the u"..." syntax:
text=u"Larry Wall is ùm© some text"
# ^
(Besides you will have to fix the last statement in your first example - currently it prints the input string (text), whereas I am pretty sure that you meant to see the result (text2)).