Issues seen during utf-7 conversion

Issues seen during utf-7 conversion - python-2.7

Python : 2.7
I need to convert to utf-7 before I go ahead, so I have used below code in python 2.7 interpreter:
>>> mbox = u'한국의'
>>> mbox.encode('utf-7').replace(b"+", b"&").replace(b"/", b",")
'&1VytbcdY-'
Same code when I use in my python script as shown below, the output for mbox is b'&Ti1W,XaE' instead of b'&Ti1W,XaE-' i.e. "-" at end of string is missing when running as a script instead of interpreter.
mbox = "b'" + mbox + "'"
print mbox
mbox = mbox.encode('utf-7').replace(b"+", b"&").replace(b"/", b",")
print mbox
Please suggest.

Quoting from Wikipedia's description of UTF-7:
Other characters must be encoded in UTF-16 (hence U+10000 and higher would be encoded into surrogates), big-endian (hence higher-order bits appear first), and then in modified Base64. The start of these blocks of modified Base64 encoded UTF-16 is indicated by a + sign. The end is indicated by any character not in the modified Base64 set. If the character after the modified Base64 is a - (ASCII hyphen-minus) then it is consumed by the decoder and decoding resumes with the next character. Otherwise decoding resumes with the character after the base64.
Any block of encoded characters must end with a non-Base64 character. If the string includes such a character, it will be used, otherwise - is added to the end of the block. Your first example includes a - for this reason. Your second example doesn't need one because ' is not part of the Base64 character set.
If your intent is to create a Python literal that creates a valid UTF-7 string, just do things in a different order.
mbox = b"b'" + mbox.encode('utf-7').replace(b"+", b"&").replace(b"/", b",") + b"'"

Related

Convert single Unicode character to ASCII character

I got a unicode e.g. "00C4" saved in an array. I want to replace a placeholder e.g. "\A25" in a text with the ascii value of an unicode from the array which only has the unicode value. I tried everything from encoding, decoding, raw strings, unicode strings and different setups with the escape symbol "\". The issue here is that i can not write the clear '\u1234' in the code, I have to use the array values and combine it with something like '\u'. This is my current code:
e.g. prototypeArray[i][1] = 00C4
e.g. prototypeArray[i][0] = A25
unicodeChar = u'\\u' + prototypeArray[i][1]
placeholder = '\\' + prototypeArray[i][0]
placeholder = u'' + placeholder
text = text.replace(placeholder,s)
Currently it is only replacing e.g. \A25 with \u00C4 in the text. The unicode character is not interpreted as such.

UTF-8 specific interpretation:
I assume you have the unicode point represented in hexadecimal in UTF-8 stored as a string in a variable (c). And you want to determine the corresponding character. Then the following code snippet shows how to do it:
>>> import binascii
>>> cp2chr = lambda c: binascii.unhexlify(c.zfill(len(c) + (len(c) & 1))).decode('utf-8')
>>> cp2chr('C484')
'Ą'
Explanation: zfill prepends a zero if the number of characters is odd. binascii.unhexlify basically takes two characters each, interprets them as hexadecimal numbers and make them one byte. All those bytes are merged to a bytes array. Finally str.decode('utf-8') interprets those bytes as UTF-8 encoded data and returns it as string.
>>> cp2chr('00C4')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "<stdin>", line 1, in <lambda>
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc4 in position 1: unexpected end of data
Your provided example, however, is not valid UTF-8 data. See Wikipedia's UTF-8 byte structure table to identify valid byte sequences. C4 has bit structure 11000100, is therefore a continuation byte and requires another character afterwards.
Encoding independent interpretation:
So you might be looking for interpretation of unicode points independent of the encoding. Then you are looking for the raw_unicode_escape encoding:
>>> cp2chr = lambda c: (b'\\u' + c.encode('ascii')).decode('raw_unicode_escape')
>>> cp2chr('00C4')
'Ä'
Explanation: raw_unicode_escape convert the unicode escape sequences given in a byte string and returns it as string: b'\\u00C4'.decode('raw_unicode_escape') gives Ä. This is what python does internally if you write \uSOMETHING in your source code.

Python 2.7, convert utf8 string to ascii

I am working with python 2.7.12
I have string which contains a unicode literal, which is not of type Unicode. I would like to convert this to text. This example explains what I am trying to do.
>>> s
'\x00u\x00s\x00e\x00r\x00n\x00a\x00m\x00e\x00'
>>> print s
username
>>> type(s)
<type 'str'>
>>> s == "username"
False
How would I go about converting this string?

That's not UTF-8, it's UTF-16, though it's unclear whether it's big endian or little endian (you have no BOM, and you have a leading and trailing NUL byte, making it an uneven length). For text in the ASCII range, UTF-8 is indistinguishable from ASCII, while UTF-16 alternates NUL bytes with the ASCII encoded bytes (as in your example).
In any event, converting to plain ASCII is fairly easy, you just need to deal with the uneven length one way or another:
s = 'u\x00s\x00e\x00r\x00n\x00a\x00m\x00e\x00' # I removed \x00 from beginning manually
sascii = s.decode('utf-16-le').encode('ascii')
# Or without manually removing leading \x00
sascii = s.decode('utf-16-be', errors='ignore').encode('ascii')
Course, if your inputs are just NUL interspersed ASCII and you can't figure out the endianness or how to get an even number of bytes, you can just cheat:
sascii = s.replace('\x00', '')
But that won't raise exceptions in the case where the input is some completely different encoding, so it may hide errors that specifying what you expect would have caught.

ASCII Control characters: \x0e - \x1f

I want to convert the \x0e and \x0f characters to equivalent keyboard text.
Does python able to encode/decode the ASCII control characters(\x0e - \x1f) to keyboard text.

This is not encoded (well not in ascii anyway). This is how the text is supposed to look.
ascii is encoded something like \55 (which is a "-") and nothing like you have given above.
Proof of this can be found if we run your commands and then \55 through this simple program I built:
text = "\x0e \x0f \55" # What we want to try goes here
new_text = text.encode('ascii') # What we want to encode it in
print new_text # Print the outcome
The outcome is:
\x0e \x0f -
This shows that \55 has been converted and so has \x0e \x0f however they remain the same because they are not encoded in ascii.

Python: ascii codec can't encode en-dash

I'm trying to print a poem from the Poetry Foundation's daily poem RSS feed with a thermal printer that supports an encoding of CP437. This means I need to translate some characters; in this case an en-dash to a hyphen. But python won't even encode the en dash to begin with. When I try to decode the string and replace the en-dash with a hyphen I get the following error:
Traceback (most recent call last):
File "pftest.py", line 46, in <module>
str = str.decode('utf-8')
File "/usr/lib/python2.7/encodings/utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2013' in position 140: ordinal not in range(128)
And here is my code:
#!/usr/bin/python
#-*- coding: utf-8 -*-
# This string is actually a variable entitled d['entries'][1].summary_detail.value
str = """Love brought by night a vision to my bed,
One that still wore the vesture of a child
But eighteen years of age – who sweetly smiled"""
str = str.decode('utf-8')
str = str.replace("\u2013", "-") #en dash
str = str.replace("\u2014", "--") #em dash
print (str)
I can actually print the output using the following code without errors in my terminal window (Mac), but my printer spits out sets of 3 CP437 characters:
str = u''.str.encode('utf-8')
I'm using Sublime Text as my editor, and I've saved the page with UTF-8 encoding, but I'm not sure that will help things. I would greatly appreciate any help with this code. Thank you!

I don't fully understand what's happening in your code, but I've also been trying to replace en-dashes with hyphens in a string I got from the Web, and here's what's working for me. My code is just this:
txt = re.sub(u"\u2013", "-", txt)
I'm using Python 2.7 and Sublime Text 2, but I don't bother setting -*- coding: utf-8 -*- in my script, as I'm trying not to introduce any new encoding issues. (Even though my variables may contain Unicode I like to keep my code pure ASCII.) Do you need to include Unicode in your .py file, or was that just to help with debugging?
I'll note that my txt variable is already a unicode string, i.e.
print type(txt)
produces
<type 'unicode'>
I'd be curious to know what type(str) would produce in your case.
One thing I noticed in your code is
str = str.replace("\u2013", "-") #en dash
Are you sure that does anything? My understanding is that \u only means "unicode character' inside a u"" string, and what you've created there is a string with 5 characters, a "u", a "2", a "0", etc. (The first character is because you can escape any character and if there's no special meaning, like in the case of '\n' or '\t', it just ignores the backslash.)
Also, the fact that you get 3 CP437 characters from your printer makes me suspect that you still have an en-dash in your string. The UTF-8 encoding of an en-dash is 3 bytes: 0xe2 0x80 0x93. When you call str.encode('utf-8') on a unicode string that contains an en-dash you get those three bytes in the returned string. I'm guessing that your terminal knows how to interpret that as an en-dash and that's what you're seeing.
If you can't get my first method to work, I'll mention that I also had success with this:
txt = txt.encode('utf-8')
txt = re.sub("\xe2\x80\x93", "-", txt)
Maybe that re.sub() would work for you if you put it after your call to encode(). And in that case you might not even need that call to decode() at all. I'll confess that I really don't understand why it's there.

Python hex variable assignment

I'm using a variable to store data that gets sent by a socket. When I assign it in my program it works but when I read it from a file it is treated as a string.
Example:
data = '\x31\x32\x33'
print data
Outputs
123 # <--- this is the result I want when I read from a file to assign data
f = open('datafile') <--- datafile contains \x31\x32\x33 on one line
data = f.readline()
print data
Outputs
\x31\x32\x33 # <--- wanted it to print 123, not \x31\x32\x33.

In Python the string '\x31\x32\x33' is actually only three characters '\x31' is the character with ordinal 0x31 (49), so '\x31' is equivalent to '1'. It sounds like your file actually contains the 12 characters \x31\x32\x33, which is equivalent to the Python string '\\x31\\x32\\x33', where the escaped backslashes represent a single backslash character (this can also be represented with the raw string literal r'\x31\x32\x33').
If you really are sure that this data should be '123', then you need to look at how that file is being written. If that is something you can control then you should address it there so that you don't end up with data consisting of several bytes representing hex escapes.
It is also possible that whatever is writing this data is already using some data-interchange format (similar to JSON), in which case you don't need to change how it is written, you just need to use a decoder for that data-interchange format (like json.loads(), but this isn't JSON).
If somehow neither of the above are really what you want, and you just want to figure out how to convert a string like r'\x31\x32\x33' to '123' in Python, here is how you can do that:
>>> r'\x31\x32\x33'.decode('string_escape')
'123'
Or in Python 3.x:
>>> br'\x31\x32\x33'.decode('unicode_escape')
'123'
edit: Based on comments it looks like you are actually getting hex strings like '313233', to convert a string like that to '123' you can decode using hex:
>>> '313233'.decode('hex')
'123'
Or on Python 3.x:
>>> bytes.fromhex('313233').decode('utf-8')
'123'

I might have violated many programming standards here, but the following code works for the given situation
with open('datafile') as f:
data=f.read()
data=data.lstrip('\\x') #strips the leftmost '\x' so that now string 'data' contains numbers saperated by '\x'
data=data.strip().split('\\x') #now data contains list of numbers
s=''
for d in data:
s+=chr(int(d,16)) #this converts hex ascii values to respective characters and concatenate to 's'
print s

As stated you are doing post processing, it would be easier to handle if the text was "313233"
you would then be able to use
data = "313233"
print data.decode("hex") # this will print '123'
As stated in comments this is for python 2.7, and is deprecated in 3.3. However unless this question is mis-tagged, this will work.

Yes, when you do a conversion from a string to an int, you can specify the base of the numbers in the string:
>>> print int("31", 16)
49
>>> chr(49)
'1'
So you should be able to just parse the hex values out of your file and individually convert them to chars.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Issues seen during utf-7 conversion - python-2.7

Related

Convert single Unicode character to ASCII character

Python 2.7, convert utf8 string to ascii

ASCII Control characters: \x0e - \x1f

Python: ascii codec can't encode en-dash

Python hex variable assignment

Categories

Resources