ASCII Control characters: \x0e - \x1f - python-2.7

I want to convert the \x0e and \x0f characters to equivalent keyboard text.
Does python able to encode/decode the ASCII control characters(\x0e - \x1f) to keyboard text.

This is not encoded (well not in ascii anyway). This is how the text is supposed to look.
ascii is encoded something like \55 (which is a "-") and nothing like you have given above.
Proof of this can be found if we run your commands and then \55 through this simple program I built:
text = "\x0e \x0f \55" # What we want to try goes here
new_text = text.encode('ascii') # What we want to encode it in
print new_text # Print the outcome
The outcome is:
\x0e \x0f -
This shows that \55 has been converted and so has \x0e \x0f however they remain the same because they are not encoded in ascii.

Related

Python 2.7: Printing UTF-8 symbols as list fails, but iterating through list is successful

Currently I'm writing a script that searches a .txt file for any measurement in micrometers. These text documents commonly use the mu symbol "µ" which is where the fun begins.
p = re.compile('\d+\.\d+\s?\-?[uUµ][mM]')
file = open("text_to_be_searched.txt").read()
file = file.decode("utf-8")
match = re.findall(p, file)
if match == []:
print "No matches found"
else:
for i in range(len(match)):
match[i] = match[i].replace("\n", "") #cleans up line breaks
print match[i] #prints correctly
print match #prints incorrectly
In the above code, iterating through the list prints the values nicely to the console.
1.06 µm
10.6 µm
3.8 µm
However, if I try to print the list, it displays them incorrectly.
[u'1.06 \xb5m', u'10.6 \xb5m', u'3.8 \xb5m']
Why does the print command display the iterated values correctly but the entire list incorrectly?
EDIT:
Thanks to #BoarGules, and others.
I found that
match[i] = match[i].replace("µ", "u")
returned errors:
UnicodeEncodeError: 'ascii' codec can't encode character u'\xb5' in position 5: ordinal not in range(128)
Python is mad that the Unicode symbol isn't within the original 128 characters as explained on JoelonSoftware
But by simply telling it that the symbol was unicode:
match[i] = match[i].replace(u"µ", "u")
We get a more readable result.
[u'1.06 um', u'10.6 um', u'3.8 um']
It's a step in the right direction at least.
This is not really incorrect:
[u'1.06 \xb5m', u'10.6 \xb5m', u'3.8 \xb5m']
It is the way you would have to type the manually into your program. If you tried to do this:
['1.06 µm','10.6 µm','3.8 µm']
you would get a source encoding error (unless you put an encoding comment at the top of your program.
It is just a different representation of the same data. Recall that a list is a data structure. You can't actually print it as it is in memory because that is just a bunch of bytes. It has to be interpreted into something resembling program code, in other words, turned into a string, to be printed. The interpreter does a generic job. It has to display the difference between normal str-type strings and unicode strings (hence the u"...") and it has to escape characters outside of the ascii character set. If it didn't do that it would be much less useful.
If you have fixed ideas about how the list should be displayed then you need to format it yourself for output.

Issues seen during utf-7 conversion

Python : 2.7
I need to convert to utf-7 before I go ahead, so I have used below code in python 2.7 interpreter:
>>> mbox = u'한국의'
>>> mbox.encode('utf-7').replace(b"+", b"&").replace(b"/", b",")
'&1VytbcdY-'
Same code when I use in my python script as shown below, the output for mbox is b'&Ti1W,XaE' instead of b'&Ti1W,XaE-' i.e. "-" at end of string is missing when running as a script instead of interpreter.
mbox = "b'" + mbox + "'"
print mbox
mbox = mbox.encode('utf-7').replace(b"+", b"&").replace(b"/", b",")
print mbox
Please suggest.
Quoting from Wikipedia's description of UTF-7:
Other characters must be encoded in UTF-16 (hence U+10000 and higher would be encoded into surrogates), big-endian (hence higher-order bits appear first), and then in modified Base64. The start of these blocks of modified Base64 encoded UTF-16 is indicated by a + sign. The end is indicated by any character not in the modified Base64 set. If the character after the modified Base64 is a - (ASCII hyphen-minus) then it is consumed by the decoder and decoding resumes with the next character. Otherwise decoding resumes with the character after the base64.
Any block of encoded characters must end with a non-Base64 character. If the string includes such a character, it will be used, otherwise - is added to the end of the block. Your first example includes a - for this reason. Your second example doesn't need one because ' is not part of the Base64 character set.
If your intent is to create a Python literal that creates a valid UTF-7 string, just do things in a different order.
mbox = b"b'" + mbox.encode('utf-7').replace(b"+", b"&").replace(b"/", b",") + b"'"

Python: ascii codec can't encode en-dash

I'm trying to print a poem from the Poetry Foundation's daily poem RSS feed with a thermal printer that supports an encoding of CP437. This means I need to translate some characters; in this case an en-dash to a hyphen. But python won't even encode the en dash to begin with. When I try to decode the string and replace the en-dash with a hyphen I get the following error:
Traceback (most recent call last):
File "pftest.py", line 46, in <module>
str = str.decode('utf-8')
File "/usr/lib/python2.7/encodings/utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2013' in position 140: ordinal not in range(128)
And here is my code:
#!/usr/bin/python
#-*- coding: utf-8 -*-
# This string is actually a variable entitled d['entries'][1].summary_detail.value
str = """Love brought by night a vision to my bed,
One that still wore the vesture of a child
But eighteen years of age – who sweetly smiled"""
str = str.decode('utf-8')
str = str.replace("\u2013", "-") #en dash
str = str.replace("\u2014", "--") #em dash
print (str)
I can actually print the output using the following code without errors in my terminal window (Mac), but my printer spits out sets of 3 CP437 characters:
str = u''.str.encode('utf-8')
I'm using Sublime Text as my editor, and I've saved the page with UTF-8 encoding, but I'm not sure that will help things. I would greatly appreciate any help with this code. Thank you!
I don't fully understand what's happening in your code, but I've also been trying to replace en-dashes with hyphens in a string I got from the Web, and here's what's working for me. My code is just this:
txt = re.sub(u"\u2013", "-", txt)
I'm using Python 2.7 and Sublime Text 2, but I don't bother setting -*- coding: utf-8 -*- in my script, as I'm trying not to introduce any new encoding issues. (Even though my variables may contain Unicode I like to keep my code pure ASCII.) Do you need to include Unicode in your .py file, or was that just to help with debugging?
I'll note that my txt variable is already a unicode string, i.e.
print type(txt)
produces
<type 'unicode'>
I'd be curious to know what type(str) would produce in your case.
One thing I noticed in your code is
str = str.replace("\u2013", "-") #en dash
Are you sure that does anything? My understanding is that \u only means "unicode character' inside a u"" string, and what you've created there is a string with 5 characters, a "u", a "2", a "0", etc. (The first character is because you can escape any character and if there's no special meaning, like in the case of '\n' or '\t', it just ignores the backslash.)
Also, the fact that you get 3 CP437 characters from your printer makes me suspect that you still have an en-dash in your string. The UTF-8 encoding of an en-dash is 3 bytes: 0xe2 0x80 0x93. When you call str.encode('utf-8') on a unicode string that contains an en-dash you get those three bytes in the returned string. I'm guessing that your terminal knows how to interpret that as an en-dash and that's what you're seeing.
If you can't get my first method to work, I'll mention that I also had success with this:
txt = txt.encode('utf-8')
txt = re.sub("\xe2\x80\x93", "-", txt)
Maybe that re.sub() would work for you if you put it after your call to encode(). And in that case you might not even need that call to decode() at all. I'll confess that I really don't understand why it's there.

How to use Regex to strip punctuation without tainting UTF-8 or UTF-16 encoded text like chinese?

How do I strip punctuation from ASCII and UTF-8 encoded strings without messing up the UTF-8 original characters, specifically Chinese, in R.
text <- "Longchamp Le Pliage 肩背包 (小)"
stri_replace_all_regex(text, '\\p{P}', '')
results in:
Longchamp Le Pliage ��背�� 小
but the desired result should be:
Longchamp Le Pliage 肩背包 小
I'm looking to remove all the CJK Symbols and Punctuation as well ask ASCII punctuations.
#akrun, sessionInfo() is as follows
locale:
[1] LC_COLLATE=English_Singapore.1252 LC_CTYPE=English_Singapore.1252 LC_MONETARY=English_Singapore.1252
[4] LC_NUMERIC=C LC_TIME=English_Singapore.1252
Display of Chinese characters (hanzi) works variably depending on platform and IDE (see this answer for lots of details about R's handling of non-ASCII characters). It looks to me like stri_replace_all_regex is doing what you want, but that some of the hanzi are being displayed wrong (even if their underlying codepoints are correct). Try this:
library(stringi)
my_text <- "Longchamp Le Pliage 肩背包 (小)"
plot(0,0)
text(0, 0, my_text, pos=3)
If you can get the text to display on a plot, then underlyingly the string is properly encoded and the problem is just how it displays in the R terminal. If not, check Encoding(my_text) and consider using enc2utf8 before further text processing. If the plotting worked, try:
no_punct <- stri_replace_all_regex(my_text, "\\p{P}", "")
text(0, 0, no_punct, pos=1)
to see if the result of stri_replace_all_regex is in fact doing what you expect.

Python hex variable assignment

I'm using a variable to store data that gets sent by a socket. When I assign it in my program it works but when I read it from a file it is treated as a string.
Example:
data = '\x31\x32\x33'
print data
Outputs
123 # <--- this is the result I want when I read from a file to assign data
f = open('datafile') <--- datafile contains \x31\x32\x33 on one line
data = f.readline()
print data
Outputs
\x31\x32\x33 # <--- wanted it to print 123, not \x31\x32\x33.
In Python the string '\x31\x32\x33' is actually only three characters '\x31' is the character with ordinal 0x31 (49), so '\x31' is equivalent to '1'. It sounds like your file actually contains the 12 characters \x31\x32\x33, which is equivalent to the Python string '\\x31\\x32\\x33', where the escaped backslashes represent a single backslash character (this can also be represented with the raw string literal r'\x31\x32\x33').
If you really are sure that this data should be '123', then you need to look at how that file is being written. If that is something you can control then you should address it there so that you don't end up with data consisting of several bytes representing hex escapes.
It is also possible that whatever is writing this data is already using some data-interchange format (similar to JSON), in which case you don't need to change how it is written, you just need to use a decoder for that data-interchange format (like json.loads(), but this isn't JSON).
If somehow neither of the above are really what you want, and you just want to figure out how to convert a string like r'\x31\x32\x33' to '123' in Python, here is how you can do that:
>>> r'\x31\x32\x33'.decode('string_escape')
'123'
Or in Python 3.x:
>>> br'\x31\x32\x33'.decode('unicode_escape')
'123'
edit: Based on comments it looks like you are actually getting hex strings like '313233', to convert a string like that to '123' you can decode using hex:
>>> '313233'.decode('hex')
'123'
Or on Python 3.x:
>>> bytes.fromhex('313233').decode('utf-8')
'123'
I might have violated many programming standards here, but the following code works for the given situation
with open('datafile') as f:
data=f.read()
data=data.lstrip('\\x') #strips the leftmost '\x' so that now string 'data' contains numbers saperated by '\x'
data=data.strip().split('\\x') #now data contains list of numbers
s=''
for d in data:
s+=chr(int(d,16)) #this converts hex ascii values to respective characters and concatenate to 's'
print s
As stated you are doing post processing, it would be easier to handle if the text was "313233"
you would then be able to use
data = "313233"
print data.decode("hex") # this will print '123'
As stated in comments this is for python 2.7, and is deprecated in 3.3. However unless this question is mis-tagged, this will work.
Yes, when you do a conversion from a string to an int, you can specify the base of the numbers in the string:
>>> print int("31", 16)
49
>>> chr(49)
'1'
So you should be able to just parse the hex values out of your file and individually convert them to chars.