I'm using a variable to store data that gets sent by a socket. When I assign it in my program it works but when I read it from a file it is treated as a string.
Example:
data = '\x31\x32\x33'
print data
Outputs
123 # <--- this is the result I want when I read from a file to assign data
f = open('datafile') <--- datafile contains \x31\x32\x33 on one line
data = f.readline()
print data
Outputs
\x31\x32\x33 # <--- wanted it to print 123, not \x31\x32\x33.
In Python the string '\x31\x32\x33' is actually only three characters '\x31' is the character with ordinal 0x31 (49), so '\x31' is equivalent to '1'. It sounds like your file actually contains the 12 characters \x31\x32\x33, which is equivalent to the Python string '\\x31\\x32\\x33', where the escaped backslashes represent a single backslash character (this can also be represented with the raw string literal r'\x31\x32\x33').
If you really are sure that this data should be '123', then you need to look at how that file is being written. If that is something you can control then you should address it there so that you don't end up with data consisting of several bytes representing hex escapes.
It is also possible that whatever is writing this data is already using some data-interchange format (similar to JSON), in which case you don't need to change how it is written, you just need to use a decoder for that data-interchange format (like json.loads(), but this isn't JSON).
If somehow neither of the above are really what you want, and you just want to figure out how to convert a string like r'\x31\x32\x33' to '123' in Python, here is how you can do that:
>>> r'\x31\x32\x33'.decode('string_escape')
'123'
Or in Python 3.x:
>>> br'\x31\x32\x33'.decode('unicode_escape')
'123'
edit: Based on comments it looks like you are actually getting hex strings like '313233', to convert a string like that to '123' you can decode using hex:
>>> '313233'.decode('hex')
'123'
Or on Python 3.x:
>>> bytes.fromhex('313233').decode('utf-8')
'123'
I might have violated many programming standards here, but the following code works for the given situation
with open('datafile') as f:
data=f.read()
data=data.lstrip('\\x') #strips the leftmost '\x' so that now string 'data' contains numbers saperated by '\x'
data=data.strip().split('\\x') #now data contains list of numbers
s=''
for d in data:
s+=chr(int(d,16)) #this converts hex ascii values to respective characters and concatenate to 's'
print s
As stated you are doing post processing, it would be easier to handle if the text was "313233"
you would then be able to use
data = "313233"
print data.decode("hex") # this will print '123'
As stated in comments this is for python 2.7, and is deprecated in 3.3. However unless this question is mis-tagged, this will work.
Yes, when you do a conversion from a string to an int, you can specify the base of the numbers in the string:
>>> print int("31", 16)
49
>>> chr(49)
'1'
So you should be able to just parse the hex values out of your file and individually convert them to chars.
Related
I have a list of unicode symbols from the emoji package. My end goal is to create a function that takes as input a unicode a string, i.e. some👩😌thing, and then removes all emojis, i.e. "something". Below is a demonstration of what I want to achieve:
from emoji import UNICODE_EMOJI
text = 'some👩😌thing'
exclude_list = UNICODE_EMOJI.keys()
output = ... = 'something'
I have been trying to do the above, and in that process, I came across a strange behavior which I demonstrate below, as you can see. I believe if the code below is fixed, then I will be able to achieve my end goal.
import regex as re
print u'\U0001F469' # 👩
print u'\U0001F60C' # 😌
print u'\U0001F469\U0001F60C' # 👩😌
text = u'some\U0001F469\U0001F60Cthing'
print text # some👩😌thing
# Removing "👩😌" works
print re.sub(ur'[\U0001f469\U0001F60C]+', u'', text) # something
# Removing only "👩" doesn't work
print re.sub(ur'[\U0001f469]+', u'', text) # some�thing
In most builds of Python 2.7, Unicode codepoints above 0x10000 are encoded as a surrogate pair, meaning Python actually sees them as two characters. You can prove this to yourself with len(u'\U0001F469').
The best way to solve this is to move to a version of Python that properly treats those codepoints as a single entity rather than a surrogate pair. You can compile Python 2.7 for this, and the recent versions of Python 3 will do it automatically.
To create a regular expression to use for the replace, simply join all the characters together with |. Since the list of characters already is encoded with surrogate pairs it will create the proper string.
subs = u'|'.join(exclude_list)
print re.sub(subs, u'', text)
The old 2.7 regex engine gets confused because:
Python 2.7 uses a forced word-based Unicode storage, in which certain Unicode codepoints are automatically substituted by surrogate pairs.
Before the regex "sees" your Python string, Python already helpfully parsed your large Unicode codepoints into two separate characters (each on its own a valid – but incomplete – single Unicode character).
That means that [\U0001f469]+' replaces something (a character class of 2 characters), but one of them is in your string and the other is not. That leads to your badly formed output.
This fixes it:
print re.sub(ur'(\U0001f469|U0001F60C)+', u'', text) # something
# Removing only "👩" doesn't work
print re.sub(ur'(\U0001f469)+', u'', text) # some�thing
# .. and now it does:
some😌thing
because now the regex engine sees the exact same sequence of characters – surrogate pairs or otherwise – that you are looking for.
If you want to remove all emoji from the exclude_list, you can explicitly loop over its contents and replace one by one:
exclude_list = UNICODE_EMOJI.keys()
for bad in exclude_list: # or simply "for bad in UNICODE_EMOJI" if you gotta catch them all
if bad in text:
print 'Removing '+bad
text = text.replace(bad, '')
Removing 👩
Removing 😌
something
(This also shows the intermediate results as proof it works; you only need the replace line in the loop.)
To remove all emojis from the input string using the current approach, use
import re
from emoji import UNICODE_EMOJI
text = u'some👩😌thing'
exclude_list = UNICODE_EMOJI.keys()
rx = ur"(?:{})+".format("|".join(map(re.escape,exclude_list)))
print re.sub(rx, u'', text)
# => u'something'
If you do not re.escape the emoji chars, you will get nothing to repeat error due to the literal chars messing up with the alternation operators inside the group, so map(re.escape,exclude_list) is required.
Tested in Python 2.7.12 (default, Nov 12 2018, 14:36:49)
[GCC 5.4.0 20160609] on linux2.
How to convert formal Unicode notation like 'U+1F600' into something like this: '\U0001F600', which I saw represented as 'Python Src' at websites online?
My end-goal is to use Unicode for emojis in Python(2.x) and I am able to achieve it in this way:
unicode_string = '\U0001F600'
unicode_string.decode('unicode-escape')
I would appreciate if you could mention the different character sets involved in the above problem.
The simplest way to do it is to just treat the notation as a string:
>>> s = 'U+1F600'
>>> s[2:] # chop off the U+
'1F600'
>>> s[2:].rjust(8, '0') # pad it to 8 characters with 0s
'0001F600'
>>> r'\U' + s[2:].rjust(8, '0') # prepend the `\U`
'\\U0001F600'
It might be a bit cleaner to parse the string as hex and then format the resulting number back out:
>>> int(s[2:], 16)
128512
>>> n = int(s[2:], 16)
>>> rf'\U{n:08X}'
'\\U0001F600'
… but I'm not sure it's really any easier to understand that way.
If you need to extract these from a larger string, you probably want a regular expression.
We want to match a literal U+ followed by 1 to 8 hex digits, right? So, that's U\+[0-9a-fA-F]{1,8}. Except we really don't need to include the U+ just to pull it off with [2:], so let's group the rest of it: U\+([0-9a-fA-F]{1,8}).
>>> s = 'Hello U+1F600 world'
>>> re.search(r'U\+([0-9a-fA-F]{1,8})', s)
<_sre.SRE_Match object; span=(6, 13), match='U+1F600'>
>>> re.search(r'U\+([0-9a-fA-F]{1,8})', s).group(1)
'1F600'
Now, we can use re.sub with a function to apply the \U prepending and rjust padding:
>>> re.sub(r'U\+([0-9a-fA-F]{1,8})', lambda match: r'\U' + match.group(1).rjust(8, '0'), s)
'Hello \\U0001F600 world'
That's probably more readable if you define the function out-of-line:
>>> def padunimatch(match):
... return r'\U' + match.group(1).rjust(8, '0')
>>> re.sub(r'U\+([0-9a-fA-F]{1,8})', padunimatch, s)
'Hello \\U0001F600 world'
Or, if you prefer to do it numerically:
>>> def padunimatch(match):
... n = int(match.group(1), 16)
... return rf'\U{n:08X}'
>>> re.sub(r'U\+([0-9a-fA-F]{1,8})', padunimatch, s)
'Hello \\U0001F600 world'
And of course you already know how to do the last part, because it's in your question, right? Well, not quite: you can't call decode on a string, only on a bytes. The simplest way around this is to use the codec directly:
>>> x = 'Hello \\U0001F600 world'
>>> codecs.decode(x, 'unicode_escape')
'Hello 😀 world'
… unless you're using Python 2. In that case, the str type isn't a Unicode string, it's a byte-string, so decode actually works fine. But in Python 2, you'll run into other problems, unless all of your text is pure ASCII (with any non-ASCII characters encoded as U+xxxx sequences).
For example, let's say your input was:
>>> s = 'Hej U+1F600 världen'
In Python 3, that's fine. That s is a Unicode string. Under the covers, my console is sending Python UTF-8-encoded bytes to standard input and expecting to get UTF-8-encoded bytes back from standard output, but that just works like magic. (Well, not quite magic—you can print(sys.stdin.encoding, sys.stdout.encoding) to see that Python knows my console is UTF-8 and uses that to decode and encode on my behalf.)
In Python 2, it's not. If my console is UTF-8, what I've actually done is equivalent to:
>>> s = 'Hej U+1F600 v\xc3\xa4rlden'
… and if I try to decode that as unicode-escape, Python 2 will treat those \xc3 and \xa4 bytes as Latin-1 bytes, rather than UTF-8:
>>> s = 'Hej \U0001F600 v\xc3\xa4rlden'
… so what you end up with is:
>>> s.decode('unicode_escape')
u'Hej \U0001f600 v\xc3\xa4rlden'
>>> print(s.decode('unicode_escape'))
Hej 😀 världen
But what if you try to decode it as UTF-8 first, and then decode that as unicode_escape?
>>> s.decode('utf-8')
u'Hej \\U0001F600 v\xe4rlden'
>>> print(s.decode('utf-8'))
Hej \U0001F600 världen
>>> s.decode('utf-8').decode('unicode-escape')
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe4' in position 16: ordinal not in range(128)
Unlike Python 3, which just won't let you call decode on a Unicode string, Python 2 lets you do it—but it handles it by trying to encode to ASCII first, so it has something to decode, and that obviously fails here.
And you can't just use the codec directly, the way you can in Python 3:
>>> codecs.decode(s.decode('utf-8'), 'unicode_escape')
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe4' in position 16: ordinal not in range(128)
You could decode the UTF-8, then unicode-escape the result, then un-unicode-escape everything, but even that isn't quite right:
>>> print(s.decode('utf-8').encode('unicode_escape').decode('unicode_escape'))
Hej \U0001F600 världen
Why? Because unicode-escape, while fixing our existing Unicode character, also escaped our backslash!
If you know you definitely have no \U escapes in the original source that you didn't want parsed, there's a quick fix for this: just replace the escaped backslash:
>>> print(s.decode('utf-8').encode('unicode_escape').replace(r'\\U', r'\U').decode('unicode_escape'))
Hej 😀 världen
If this all seems like a huge pain… well, yeah, that's why Python 3 exists, because dealing with Unicode properly in Python 2 (and notice that I didn't even really deal with it properly…) is a huge pain.
I got a unicode e.g. "00C4" saved in an array. I want to replace a placeholder e.g. "\A25" in a text with the ascii value of an unicode from the array which only has the unicode value. I tried everything from encoding, decoding, raw strings, unicode strings and different setups with the escape symbol "\". The issue here is that i can not write the clear '\u1234' in the code, I have to use the array values and combine it with something like '\u'. This is my current code:
e.g. prototypeArray[i][1] = 00C4
e.g. prototypeArray[i][0] = A25
unicodeChar = u'\\u' + prototypeArray[i][1]
placeholder = '\\' + prototypeArray[i][0]
placeholder = u'' + placeholder
text = text.replace(placeholder,s)
Currently it is only replacing e.g. \A25 with \u00C4 in the text. The unicode character is not interpreted as such.
UTF-8 specific interpretation:
I assume you have the unicode point represented in hexadecimal in UTF-8 stored as a string in a variable (c). And you want to determine the corresponding character. Then the following code snippet shows how to do it:
>>> import binascii
>>> cp2chr = lambda c: binascii.unhexlify(c.zfill(len(c) + (len(c) & 1))).decode('utf-8')
>>> cp2chr('C484')
'Ä„'
Explanation: zfill prepends a zero if the number of characters is odd. binascii.unhexlify basically takes two characters each, interprets them as hexadecimal numbers and make them one byte. All those bytes are merged to a bytes array. Finally str.decode('utf-8') interprets those bytes as UTF-8 encoded data and returns it as string.
>>> cp2chr('00C4')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "<stdin>", line 1, in <lambda>
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc4 in position 1: unexpected end of data
Your provided example, however, is not valid UTF-8 data. See Wikipedia's UTF-8 byte structure table to identify valid byte sequences. C4 has bit structure 11000100, is therefore a continuation byte and requires another character afterwards.
Encoding independent interpretation:
So you might be looking for interpretation of unicode points independent of the encoding. Then you are looking for the raw_unicode_escape encoding:
>>> cp2chr = lambda c: (b'\\u' + c.encode('ascii')).decode('raw_unicode_escape')
>>> cp2chr('00C4')
'Ä'
Explanation: raw_unicode_escape convert the unicode escape sequences given in a byte string and returns it as string: b'\\u00C4'.decode('raw_unicode_escape') gives Ä. This is what python does internally if you write \uSOMETHING in your source code.
data = "000000000000000117c80378b8da0e33559b5997f2ad55e2f7d18ec1975b9717"
result1 = data.decode('hex')[::-1]
The hex data are decoded to decimal, which is 6,860,217,587,554,922,525,607,992,740,653,361,396,256,930,700,588,249,487,127
Then the decimal number 6,860,217,587,554,922,525,607,992,740,653,361,396,256,930,700,588,249,487,127 is converted to bits and reversed its order (little-endian) and stored in result1 variable as a bitarray?
Is this what exactly happens with that code or did I misunderstood anything?
So the result1 variable is a bitarray?
If it's just a integer variable, how can it hold that much long decimal value?
Strings in python are declared using double or single quotes, therefore the variable data contains a string.
You can check the type of a variable directly in python:
data = "000000000000000117c80378b8da0e33559b5997f2ad55e2f7d18ec1975b9717"
type(data)
which outputs
str
meaning that the variable is a string.
When you call the function decode('hex') on a string you obtain another string:
data.decode('hex')
'\x00\x00\x00\x00\x00\x00\x00\x01\x17\xc8\x03x\xb8\xda\x0e3U\x9bY\x97\xf2\xadU\xe2\xf7\xd1\x8e\xc1\x97[\x97\x17'
Every character in your original string is interpreted as an hexadecimal number, and every pairs of hexadecimal numbers - e.s. "17" - is converted into an hexadecimal character using the escape sequence \x - becoming "\x17".
When you write "\x41" you are basically telling python to interpret 41 as a single ASCII character whose hexadecimal representation is 41.
The ASCII table contains the hexadecimal, decimal and octal values associated to the ascii characters.
If you try for example
"48454C4C4F".decode('hex')
you obtain the string "HELLO"
Lastly when you use [::-1] on a string you reverse it:
"48454C4C4F".decode('hex')[::-1]
produces the string "OLLEH"
You can find more about the escape characters reading the python documentation.
I want to convert the \x0e and \x0f characters to equivalent keyboard text.
Does python able to encode/decode the ASCII control characters(\x0e - \x1f) to keyboard text.
This is not encoded (well not in ascii anyway). This is how the text is supposed to look.
ascii is encoded something like \55 (which is a "-") and nothing like you have given above.
Proof of this can be found if we run your commands and then \55 through this simple program I built:
text = "\x0e \x0f \55" # What we want to try goes here
new_text = text.encode('ascii') # What we want to encode it in
print new_text # Print the outcome
The outcome is:
\x0e \x0f -
This shows that \55 has been converted and so has \x0e \x0f however they remain the same because they are not encoded in ascii.