Python: ascii codec can't encode en-dash - python-2.7

I'm trying to print a poem from the Poetry Foundation's daily poem RSS feed with a thermal printer that supports an encoding of CP437. This means I need to translate some characters; in this case an en-dash to a hyphen. But python won't even encode the en dash to begin with. When I try to decode the string and replace the en-dash with a hyphen I get the following error:
Traceback (most recent call last):
File "pftest.py", line 46, in <module>
str = str.decode('utf-8')
File "/usr/lib/python2.7/encodings/utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2013' in position 140: ordinal not in range(128)
And here is my code:
#!/usr/bin/python
#-*- coding: utf-8 -*-
# This string is actually a variable entitled d['entries'][1].summary_detail.value
str = """Love brought by night a vision to my bed,
One that still wore the vesture of a child
But eighteen years of age – who sweetly smiled"""
str = str.decode('utf-8')
str = str.replace("\u2013", "-") #en dash
str = str.replace("\u2014", "--") #em dash
print (str)
I can actually print the output using the following code without errors in my terminal window (Mac), but my printer spits out sets of 3 CP437 characters:
str = u''.str.encode('utf-8')
I'm using Sublime Text as my editor, and I've saved the page with UTF-8 encoding, but I'm not sure that will help things. I would greatly appreciate any help with this code. Thank you!

I don't fully understand what's happening in your code, but I've also been trying to replace en-dashes with hyphens in a string I got from the Web, and here's what's working for me. My code is just this:
txt = re.sub(u"\u2013", "-", txt)
I'm using Python 2.7 and Sublime Text 2, but I don't bother setting -*- coding: utf-8 -*- in my script, as I'm trying not to introduce any new encoding issues. (Even though my variables may contain Unicode I like to keep my code pure ASCII.) Do you need to include Unicode in your .py file, or was that just to help with debugging?
I'll note that my txt variable is already a unicode string, i.e.
print type(txt)
produces
<type 'unicode'>
I'd be curious to know what type(str) would produce in your case.
One thing I noticed in your code is
str = str.replace("\u2013", "-") #en dash
Are you sure that does anything? My understanding is that \u only means "unicode character' inside a u"" string, and what you've created there is a string with 5 characters, a "u", a "2", a "0", etc. (The first character is because you can escape any character and if there's no special meaning, like in the case of '\n' or '\t', it just ignores the backslash.)
Also, the fact that you get 3 CP437 characters from your printer makes me suspect that you still have an en-dash in your string. The UTF-8 encoding of an en-dash is 3 bytes: 0xe2 0x80 0x93. When you call str.encode('utf-8') on a unicode string that contains an en-dash you get those three bytes in the returned string. I'm guessing that your terminal knows how to interpret that as an en-dash and that's what you're seeing.
If you can't get my first method to work, I'll mention that I also had success with this:
txt = txt.encode('utf-8')
txt = re.sub("\xe2\x80\x93", "-", txt)
Maybe that re.sub() would work for you if you put it after your call to encode(). And in that case you might not even need that call to decode() at all. I'll confess that I really don't understand why it's there.

Related

Replace all emojis from a given unicode string

I have a list of unicode symbols from the emoji package. My end goal is to create a function that takes as input a unicode a string, i.e. some👩😌thing, and then removes all emojis, i.e. "something". Below is a demonstration of what I want to achieve:
from emoji import UNICODE_EMOJI
text = 'some👩😌thing'
exclude_list = UNICODE_EMOJI.keys()
output = ... = 'something'
I have been trying to do the above, and in that process, I came across a strange behavior which I demonstrate below, as you can see. I believe if the code below is fixed, then I will be able to achieve my end goal.
import regex as re
print u'\U0001F469' # 👩
print u'\U0001F60C' # 😌
print u'\U0001F469\U0001F60C' # 👩😌
text = u'some\U0001F469\U0001F60Cthing'
print text # some👩😌thing
# Removing "👩😌" works
print re.sub(ur'[\U0001f469\U0001F60C]+', u'', text) # something
# Removing only "👩" doesn't work
print re.sub(ur'[\U0001f469]+', u'', text) # some�thing
In most builds of Python 2.7, Unicode codepoints above 0x10000 are encoded as a surrogate pair, meaning Python actually sees them as two characters. You can prove this to yourself with len(u'\U0001F469').
The best way to solve this is to move to a version of Python that properly treats those codepoints as a single entity rather than a surrogate pair. You can compile Python 2.7 for this, and the recent versions of Python 3 will do it automatically.
To create a regular expression to use for the replace, simply join all the characters together with |. Since the list of characters already is encoded with surrogate pairs it will create the proper string.
subs = u'|'.join(exclude_list)
print re.sub(subs, u'', text)
The old 2.7 regex engine gets confused because:
Python 2.7 uses a forced word-based Unicode storage, in which certain Unicode codepoints are automatically substituted by surrogate pairs.
Before the regex "sees" your Python string, Python already helpfully parsed your large Unicode codepoints into two separate characters (each on its own a valid – but incomplete – single Unicode character).
That means that [\U0001f469]+' replaces something (a character class of 2 characters), but one of them is in your string and the other is not. That leads to your badly formed output.
This fixes it:
print re.sub(ur'(\U0001f469|U0001F60C)+', u'', text) # something
# Removing only "👩" doesn't work
print re.sub(ur'(\U0001f469)+', u'', text) # some�thing
# .. and now it does:
some😌thing
because now the regex engine sees the exact same sequence of characters – surrogate pairs or otherwise – that you are looking for.
If you want to remove all emoji from the exclude_list, you can explicitly loop over its contents and replace one by one:
exclude_list = UNICODE_EMOJI.keys()
for bad in exclude_list: # or simply "for bad in UNICODE_EMOJI" if you gotta catch them all
if bad in text:
print 'Removing '+bad
text = text.replace(bad, '')
Removing 👩
Removing 😌
something
(This also shows the intermediate results as proof it works; you only need the replace line in the loop.)
To remove all emojis from the input string using the current approach, use
import re
from emoji import UNICODE_EMOJI
text = u'some👩😌thing'
exclude_list = UNICODE_EMOJI.keys()
rx = ur"(?:{})+".format("|".join(map(re.escape,exclude_list)))
print re.sub(rx, u'', text)
# => u'something'
If you do not re.escape the emoji chars, you will get nothing to repeat error due to the literal chars messing up with the alternation operators inside the group, so map(re.escape,exclude_list) is required.
Tested in Python 2.7.12 (default, Nov 12 2018, 14:36:49)
[GCC 5.4.0 20160609] on linux2.

Python 2.7: Printing UTF-8 symbols as list fails, but iterating through list is successful

Currently I'm writing a script that searches a .txt file for any measurement in micrometers. These text documents commonly use the mu symbol "µ" which is where the fun begins.
p = re.compile('\d+\.\d+\s?\-?[uUµ][mM]')
file = open("text_to_be_searched.txt").read()
file = file.decode("utf-8")
match = re.findall(p, file)
if match == []:
print "No matches found"
else:
for i in range(len(match)):
match[i] = match[i].replace("\n", "") #cleans up line breaks
print match[i] #prints correctly
print match #prints incorrectly
In the above code, iterating through the list prints the values nicely to the console.
1.06 µm
10.6 µm
3.8 µm
However, if I try to print the list, it displays them incorrectly.
[u'1.06 \xb5m', u'10.6 \xb5m', u'3.8 \xb5m']
Why does the print command display the iterated values correctly but the entire list incorrectly?
EDIT:
Thanks to #BoarGules, and others.
I found that
match[i] = match[i].replace("µ", "u")
returned errors:
UnicodeEncodeError: 'ascii' codec can't encode character u'\xb5' in position 5: ordinal not in range(128)
Python is mad that the Unicode symbol isn't within the original 128 characters as explained on JoelonSoftware
But by simply telling it that the symbol was unicode:
match[i] = match[i].replace(u"µ", "u")
We get a more readable result.
[u'1.06 um', u'10.6 um', u'3.8 um']
It's a step in the right direction at least.
This is not really incorrect:
[u'1.06 \xb5m', u'10.6 \xb5m', u'3.8 \xb5m']
It is the way you would have to type the manually into your program. If you tried to do this:
['1.06 µm','10.6 µm','3.8 µm']
you would get a source encoding error (unless you put an encoding comment at the top of your program.
It is just a different representation of the same data. Recall that a list is a data structure. You can't actually print it as it is in memory because that is just a bunch of bytes. It has to be interpreted into something resembling program code, in other words, turned into a string, to be printed. The interpreter does a generic job. It has to display the difference between normal str-type strings and unicode strings (hence the u"...") and it has to escape characters outside of the ascii character set. If it didn't do that it would be much less useful.
If you have fixed ideas about how the list should be displayed then you need to format it yourself for output.

Issues seen during utf-7 conversion

Python : 2.7
I need to convert to utf-7 before I go ahead, so I have used below code in python 2.7 interpreter:
>>> mbox = u'한국의'
>>> mbox.encode('utf-7').replace(b"+", b"&").replace(b"/", b",")
'&1VytbcdY-'
Same code when I use in my python script as shown below, the output for mbox is b'&Ti1W,XaE' instead of b'&Ti1W,XaE-' i.e. "-" at end of string is missing when running as a script instead of interpreter.
mbox = "b'" + mbox + "'"
print mbox
mbox = mbox.encode('utf-7').replace(b"+", b"&").replace(b"/", b",")
print mbox
Please suggest.
Quoting from Wikipedia's description of UTF-7:
Other characters must be encoded in UTF-16 (hence U+10000 and higher would be encoded into surrogates), big-endian (hence higher-order bits appear first), and then in modified Base64. The start of these blocks of modified Base64 encoded UTF-16 is indicated by a + sign. The end is indicated by any character not in the modified Base64 set. If the character after the modified Base64 is a - (ASCII hyphen-minus) then it is consumed by the decoder and decoding resumes with the next character. Otherwise decoding resumes with the character after the base64.
Any block of encoded characters must end with a non-Base64 character. If the string includes such a character, it will be used, otherwise - is added to the end of the block. Your first example includes a - for this reason. Your second example doesn't need one because ' is not part of the Base64 character set.
If your intent is to create a Python literal that creates a valid UTF-7 string, just do things in a different order.
mbox = b"b'" + mbox.encode('utf-7').replace(b"+", b"&").replace(b"/", b",") + b"'"

Regex search with variable in Python 2.7 returns bytes instead of decoded text

The words of the "wordslist" and the text I'm searching are in Cyrillic. The text is coded in UTF-8 (as set in Notepad++). I need Python to match a word in the text and get everything after the word until a full-stop followed by new line.
EDIT
with open('C:\....txt', 'rb') as f:
wordslist = []
for line in f:
wordslist.append(line)
wordslist = map(str.strip, wordslist)
/EDIT
for i in wordslist:
print i #so far, so good, I get Cyrillic
wantedtext = re.findall(i+".*\.\r\n", open('C:\....txt', 'rb').read())
wantedtext = str(wantedtext)
print wantedtext
"Wantedtext" shows and saves as "\xd0\xb2" (etc.).
What I tried:
This question is different, because there is no variable involved:
Convert bytes to a python string. Also, the solution from the chosen answer
wantedtext.decode('utf-8')
didn't work, the result was the same. The solution from here didn't help either.
EDIT: Revised code, returning "[]".
with io.open('C:....txt', 'r', encoding='utf-8') as f:
wordslist = f.read().splitlines()
for i in wordslist:
print i
with io.open('C:....txt', 'r', encoding='utf-8') as my_file:
my_file_test = my_file.read()
print my_file_test #works, prints cyrillic characters, but...
wantedtext = re.findall(i+".*\.\r\n", my_file_test)
wantedtext = str(wantedtext)
print wantedtext #returns []
(Added after a comment below: This code works if you erase \r from the regular expression.)
Python 2.x only
Your find is probably not working because you're mixing strs and Unicodes strs, or strs containing different encodings. If you don't know what the difference between Unicode str and str, see: https://stackoverflow.com/a/35444608/1554386
Don't start decoding stuff unless you know what you're doing. It's not voodoo :)
You need to get all your text into Unicode objects first.
Split your read into a separate line - it's easier to read
Decode your text file. Use io.open() which support Python 3 decoding. I'm going assume your text file is UTF-8 (We'll soon find out if it's not):
with io.open('C:\....txt', 'r', encoding='utf-8') as my_file:
my_file_test = my_file.read()
my_file_test is now a Unicode str
Now you can do:
# finds lines beginning with i, ending in .
regex = u'^{i}*?\.$'.format(i=i)
wantedtext = re.findall(regex, my_file_test, re.M)
Look at wordslist. You don't say what you do with it but you need to make sure it's a Unicode str too. If you read from a file, use the same io.open from above.
Edit:
For wordslist, you can decode and read the file into a list while removing line feeds in one go:
with io.open('C:\....txt', 'r', encoding='utf-8') as f:
wordslist = f.read().splitlines()

Replace utf8 characters

I want to replace some utf-8 characters set with another utf-8 character set but anything I try I end up with errors.
I am a noob at Python so please be patient
What I want to achieve is converting characters by unicode values or by html entities (more readable, for maintanance)
Tries (with example):
1.First
#!/usr/bin/env python
# -*- coding: utf-8 -*-
#Found this function
def multiple_replace(dic, text):
pattern = "|".join(map(re.escape, dic.keys()))
return re.sub(pattern, lambda m: dic[m.group()], text)
text="Larry Wall is ùm© some text"
replace_table = {
u'\x97' : u'\x82' # ù -> é
}
text2=multiple_replace(dic,text)
print text #Expected:Larry Wall is ém© some text
#Got: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal
2.Html entities
dic = {
"ú" : "é" # ù -> é
}
some_text="Larry Wall is ùm© some text"
some_text2=some_text.encode('ascii', 'xmlcharrefreplace')
some_text2=multiple_replace(dic,some_text2)
print some_text2
#Got:UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 14: ordinal not in range(128)
Any ideas are welcome
Your problem is due to the fact that your input strings are in non-unicode representation (<type 'str'> rather than <type 'unicode'>). You must define the input string using the u"..." syntax:
text=u"Larry Wall is ùm© some text"
# ^
(Besides you will have to fix the last statement in your first example - currently it prints the input string (text), whereas I am pretty sure that you meant to see the result (text2)).