Python 2.7, convert utf8 string to ascii

Python 2.7, convert utf8 string to ascii - python-2.7

I am working with python 2.7.12
I have string which contains a unicode literal, which is not of type Unicode. I would like to convert this to text. This example explains what I am trying to do.
>>> s
'\x00u\x00s\x00e\x00r\x00n\x00a\x00m\x00e\x00'
>>> print s
username
>>> type(s)
<type 'str'>
>>> s == "username"
False
How would I go about converting this string?

That's not UTF-8, it's UTF-16, though it's unclear whether it's big endian or little endian (you have no BOM, and you have a leading and trailing NUL byte, making it an uneven length). For text in the ASCII range, UTF-8 is indistinguishable from ASCII, while UTF-16 alternates NUL bytes with the ASCII encoded bytes (as in your example).
In any event, converting to plain ASCII is fairly easy, you just need to deal with the uneven length one way or another:
s = 'u\x00s\x00e\x00r\x00n\x00a\x00m\x00e\x00' # I removed \x00 from beginning manually
sascii = s.decode('utf-16-le').encode('ascii')
# Or without manually removing leading \x00
sascii = s.decode('utf-16-be', errors='ignore').encode('ascii')
Course, if your inputs are just NUL interspersed ASCII and you can't figure out the endianness or how to get an even number of bytes, you can just cheat:
sascii = s.replace('\x00', '')
But that won't raise exceptions in the case where the input is some completely different encoding, so it may hide errors that specifying what you expect would have caught.

Related

How to interpret Unicode notation in Python?

How to convert formal Unicode notation like 'U+1F600' into something like this: '\U0001F600', which I saw represented as 'Python Src' at websites online?
My end-goal is to use Unicode for emojis in Python(2.x) and I am able to achieve it in this way:
unicode_string = '\U0001F600'
unicode_string.decode('unicode-escape')
I would appreciate if you could mention the different character sets involved in the above problem.

The simplest way to do it is to just treat the notation as a string:
>>> s = 'U+1F600'
>>> s[2:] # chop off the U+
'1F600'
>>> s[2:].rjust(8, '0') # pad it to 8 characters with 0s
'0001F600'
>>> r'\U' + s[2:].rjust(8, '0') # prepend the `\U`
'\\U0001F600'
It might be a bit cleaner to parse the string as hex and then format the resulting number back out:
>>> int(s[2:], 16)
128512
>>> n = int(s[2:], 16)
>>> rf'\U{n:08X}'
'\\U0001F600'
… but I'm not sure it's really any easier to understand that way.
If you need to extract these from a larger string, you probably want a regular expression.
We want to match a literal U+ followed by 1 to 8 hex digits, right? So, that's U\+[0-9a-fA-F]{1,8}. Except we really don't need to include the U+ just to pull it off with [2:], so let's group the rest of it: U\+([0-9a-fA-F]{1,8}).
>>> s = 'Hello U+1F600 world'
>>> re.search(r'U\+([0-9a-fA-F]{1,8})', s)
<_sre.SRE_Match object; span=(6, 13), match='U+1F600'>
>>> re.search(r'U\+([0-9a-fA-F]{1,8})', s).group(1)
'1F600'
Now, we can use re.sub with a function to apply the \U prepending and rjust padding:
>>> re.sub(r'U\+([0-9a-fA-F]{1,8})', lambda match: r'\U' + match.group(1).rjust(8, '0'), s)
'Hello \\U0001F600 world'
That's probably more readable if you define the function out-of-line:
>>> def padunimatch(match):
... return r'\U' + match.group(1).rjust(8, '0')
>>> re.sub(r'U\+([0-9a-fA-F]{1,8})', padunimatch, s)
'Hello \\U0001F600 world'
Or, if you prefer to do it numerically:
>>> def padunimatch(match):
... n = int(match.group(1), 16)
... return rf'\U{n:08X}'
>>> re.sub(r'U\+([0-9a-fA-F]{1,8})', padunimatch, s)
'Hello \\U0001F600 world'
And of course you already know how to do the last part, because it's in your question, right? Well, not quite: you can't call decode on a string, only on a bytes. The simplest way around this is to use the codec directly:
>>> x = 'Hello \\U0001F600 world'
>>> codecs.decode(x, 'unicode_escape')
'Hello 😀 world'
… unless you're using Python 2. In that case, the str type isn't a Unicode string, it's a byte-string, so decode actually works fine. But in Python 2, you'll run into other problems, unless all of your text is pure ASCII (with any non-ASCII characters encoded as U+xxxx sequences).
For example, let's say your input was:
>>> s = 'Hej U+1F600 världen'
In Python 3, that's fine. That s is a Unicode string. Under the covers, my console is sending Python UTF-8-encoded bytes to standard input and expecting to get UTF-8-encoded bytes back from standard output, but that just works like magic. (Well, not quite magic—you can print(sys.stdin.encoding, sys.stdout.encoding) to see that Python knows my console is UTF-8 and uses that to decode and encode on my behalf.)
In Python 2, it's not. If my console is UTF-8, what I've actually done is equivalent to:
>>> s = 'Hej U+1F600 v\xc3\xa4rlden'
… and if I try to decode that as unicode-escape, Python 2 will treat those \xc3 and \xa4 bytes as Latin-1 bytes, rather than UTF-8:
>>> s = 'Hej \U0001F600 v\xc3\xa4rlden'
… so what you end up with is:
>>> s.decode('unicode_escape')
u'Hej \U0001f600 v\xc3\xa4rlden'
>>> print(s.decode('unicode_escape'))
Hej 😀 vÃ¤rlden
But what if you try to decode it as UTF-8 first, and then decode that as unicode_escape?
>>> s.decode('utf-8')
u'Hej \\U0001F600 v\xe4rlden'
>>> print(s.decode('utf-8'))
Hej \U0001F600 världen
>>> s.decode('utf-8').decode('unicode-escape')
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe4' in position 16: ordinal not in range(128)
Unlike Python 3, which just won't let you call decode on a Unicode string, Python 2 lets you do it—but it handles it by trying to encode to ASCII first, so it has something to decode, and that obviously fails here.
And you can't just use the codec directly, the way you can in Python 3:
>>> codecs.decode(s.decode('utf-8'), 'unicode_escape')
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe4' in position 16: ordinal not in range(128)
You could decode the UTF-8, then unicode-escape the result, then un-unicode-escape everything, but even that isn't quite right:
>>> print(s.decode('utf-8').encode('unicode_escape').decode('unicode_escape'))
Hej \U0001F600 världen
Why? Because unicode-escape, while fixing our existing Unicode character, also escaped our backslash!
If you know you definitely have no \U escapes in the original source that you didn't want parsed, there's a quick fix for this: just replace the escaped backslash:
>>> print(s.decode('utf-8').encode('unicode_escape').replace(r'\\U', r'\U').decode('unicode_escape'))
Hej 😀 världen
If this all seems like a huge pain… well, yeah, that's why Python 3 exists, because dealing with Unicode properly in Python 2 (and notice that I didn't even really deal with it properly…) is a huge pain.

Convert single Unicode character to ASCII character

I got a unicode e.g. "00C4" saved in an array. I want to replace a placeholder e.g. "\A25" in a text with the ascii value of an unicode from the array which only has the unicode value. I tried everything from encoding, decoding, raw strings, unicode strings and different setups with the escape symbol "\". The issue here is that i can not write the clear '\u1234' in the code, I have to use the array values and combine it with something like '\u'. This is my current code:
e.g. prototypeArray[i][1] = 00C4
e.g. prototypeArray[i][0] = A25
unicodeChar = u'\\u' + prototypeArray[i][1]
placeholder = '\\' + prototypeArray[i][0]
placeholder = u'' + placeholder
text = text.replace(placeholder,s)
Currently it is only replacing e.g. \A25 with \u00C4 in the text. The unicode character is not interpreted as such.

UTF-8 specific interpretation:
I assume you have the unicode point represented in hexadecimal in UTF-8 stored as a string in a variable (c). And you want to determine the corresponding character. Then the following code snippet shows how to do it:
>>> import binascii
>>> cp2chr = lambda c: binascii.unhexlify(c.zfill(len(c) + (len(c) & 1))).decode('utf-8')
>>> cp2chr('C484')
'Ą'
Explanation: zfill prepends a zero if the number of characters is odd. binascii.unhexlify basically takes two characters each, interprets them as hexadecimal numbers and make them one byte. All those bytes are merged to a bytes array. Finally str.decode('utf-8') interprets those bytes as UTF-8 encoded data and returns it as string.
>>> cp2chr('00C4')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "<stdin>", line 1, in <lambda>
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc4 in position 1: unexpected end of data
Your provided example, however, is not valid UTF-8 data. See Wikipedia's UTF-8 byte structure table to identify valid byte sequences. C4 has bit structure 11000100, is therefore a continuation byte and requires another character afterwards.
Encoding independent interpretation:
So you might be looking for interpretation of unicode points independent of the encoding. Then you are looking for the raw_unicode_escape encoding:
>>> cp2chr = lambda c: (b'\\u' + c.encode('ascii')).decode('raw_unicode_escape')
>>> cp2chr('00C4')
'Ä'
Explanation: raw_unicode_escape convert the unicode escape sequences given in a byte string and returns it as string: b'\\u00C4'.decode('raw_unicode_escape') gives Ä. This is what python does internally if you write \uSOMETHING in your source code.

Issues seen during utf-7 conversion

Python : 2.7
I need to convert to utf-7 before I go ahead, so I have used below code in python 2.7 interpreter:
>>> mbox = u'한국의'
>>> mbox.encode('utf-7').replace(b"+", b"&").replace(b"/", b",")
'&1VytbcdY-'
Same code when I use in my python script as shown below, the output for mbox is b'&Ti1W,XaE' instead of b'&Ti1W,XaE-' i.e. "-" at end of string is missing when running as a script instead of interpreter.
mbox = "b'" + mbox + "'"
print mbox
mbox = mbox.encode('utf-7').replace(b"+", b"&").replace(b"/", b",")
print mbox
Please suggest.

Quoting from Wikipedia's description of UTF-7:
Other characters must be encoded in UTF-16 (hence U+10000 and higher would be encoded into surrogates), big-endian (hence higher-order bits appear first), and then in modified Base64. The start of these blocks of modified Base64 encoded UTF-16 is indicated by a + sign. The end is indicated by any character not in the modified Base64 set. If the character after the modified Base64 is a - (ASCII hyphen-minus) then it is consumed by the decoder and decoding resumes with the next character. Otherwise decoding resumes with the character after the base64.
Any block of encoded characters must end with a non-Base64 character. If the string includes such a character, it will be used, otherwise - is added to the end of the block. Your first example includes a - for this reason. Your second example doesn't need one because ' is not part of the Base64 character set.
If your intent is to create a Python literal that creates a valid UTF-7 string, just do things in a different order.
mbox = b"b'" + mbox.encode('utf-7').replace(b"+", b"&").replace(b"/", b",") + b"'"

ASCII Control characters: \x0e - \x1f

I want to convert the \x0e and \x0f characters to equivalent keyboard text.
Does python able to encode/decode the ASCII control characters(\x0e - \x1f) to keyboard text.

This is not encoded (well not in ascii anyway). This is how the text is supposed to look.
ascii is encoded something like \55 (which is a "-") and nothing like you have given above.
Proof of this can be found if we run your commands and then \55 through this simple program I built:
text = "\x0e \x0f \55" # What we want to try goes here
new_text = text.encode('ascii') # What we want to encode it in
print new_text # Print the outcome
The outcome is:
\x0e \x0f -
This shows that \55 has been converted and so has \x0e \x0f however they remain the same because they are not encoded in ascii.

Python hex variable assignment

I'm using a variable to store data that gets sent by a socket. When I assign it in my program it works but when I read it from a file it is treated as a string.
Example:
data = '\x31\x32\x33'
print data
Outputs
123 # <--- this is the result I want when I read from a file to assign data
f = open('datafile') <--- datafile contains \x31\x32\x33 on one line
data = f.readline()
print data
Outputs
\x31\x32\x33 # <--- wanted it to print 123, not \x31\x32\x33.

In Python the string '\x31\x32\x33' is actually only three characters '\x31' is the character with ordinal 0x31 (49), so '\x31' is equivalent to '1'. It sounds like your file actually contains the 12 characters \x31\x32\x33, which is equivalent to the Python string '\\x31\\x32\\x33', where the escaped backslashes represent a single backslash character (this can also be represented with the raw string literal r'\x31\x32\x33').
If you really are sure that this data should be '123', then you need to look at how that file is being written. If that is something you can control then you should address it there so that you don't end up with data consisting of several bytes representing hex escapes.
It is also possible that whatever is writing this data is already using some data-interchange format (similar to JSON), in which case you don't need to change how it is written, you just need to use a decoder for that data-interchange format (like json.loads(), but this isn't JSON).
If somehow neither of the above are really what you want, and you just want to figure out how to convert a string like r'\x31\x32\x33' to '123' in Python, here is how you can do that:
>>> r'\x31\x32\x33'.decode('string_escape')
'123'
Or in Python 3.x:
>>> br'\x31\x32\x33'.decode('unicode_escape')
'123'
edit: Based on comments it looks like you are actually getting hex strings like '313233', to convert a string like that to '123' you can decode using hex:
>>> '313233'.decode('hex')
'123'
Or on Python 3.x:
>>> bytes.fromhex('313233').decode('utf-8')
'123'

I might have violated many programming standards here, but the following code works for the given situation
with open('datafile') as f:
data=f.read()
data=data.lstrip('\\x') #strips the leftmost '\x' so that now string 'data' contains numbers saperated by '\x'
data=data.strip().split('\\x') #now data contains list of numbers
s=''
for d in data:
s+=chr(int(d,16)) #this converts hex ascii values to respective characters and concatenate to 's'
print s

As stated you are doing post processing, it would be easier to handle if the text was "313233"
you would then be able to use
data = "313233"
print data.decode("hex") # this will print '123'
As stated in comments this is for python 2.7, and is deprecated in 3.3. However unless this question is mis-tagged, this will work.

Yes, when you do a conversion from a string to an int, you can specify the base of the numbers in the string:
>>> print int("31", 16)
49
>>> chr(49)
'1'
So you should be able to just parse the hex values out of your file and individually convert them to chars.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Python 2.7, convert utf8 string to ascii - python-2.7

Related

How to interpret Unicode notation in Python?

Convert single Unicode character to ASCII character

Issues seen during utf-7 conversion

ASCII Control characters: \x0e - \x1f

Python hex variable assignment

Categories

Resources