There is a Python function, that runs in CPython 2.5.3 but crashes in Jython 2.5.3 .
It is part of a user defined function in Apache Pig, which uses Jython 2.5.3 so i cannot change it.
The input is a array of singed bytes, but in fact that are unsigned bytes, so i need to cast it.
from StringIO import StringIO
import array
import ctypes
assert isinstance(input, array.array), 'unexpected input parameter'
assert input.typecode == 'b', 'unexpected input type'
buffer = StringIO()
for byte in input:
s_byte = ctypes.c_byte(byte)
s_byte_p = ctypes.pointer(s_byte)
u_byte = ctypes.cast(s_byte_p, ctypes.POINTER(ctypes.c_ubyte)).contents.value
buffer.write(chr(u_byte))
buffer.seek(0)
output = buffer.getvalue()
assert isinstance(output, str)
The error is:
s_byte = ctypes.cast(u_byte_p, ctypes.POINTER(ctypes.c_byte)).contents.value
AttributeError: 'module' object has no attribute 'cast'
I guess the ctypes.cast functions is not implemeted in Jython 2.5.3 . Is there a workaround for that issue?
Thanks,
Steffen
Here is my solution, that is quite ugly but works without additional dependecies.
It uses the bit representation of usinged und signed bytes (https://de.wikipedia.org/wiki/Zweierkomplement).
import array
assert isinstance(input, array.array), 'unexpected input parameter'
assert input.typecode == 'b', 'unexpected input type'
output = array.array('b', [])
for byte in input:
if byte > 127:
byte = byte & 127
byte = -128 + byte
output.append(byte)
Related
This question already has answers here:
Python to show special characters
(3 answers)
Closed 4 years ago.
Hi there I am trying to make python recognize ® as a symbol( if it doesn't show up that well here but it is the symbol with a capital R within a circle known as the 'registered' symbol)
I understand that it is not recognized in python due to ASCII however i was wondering if anyone knows of a way to use a different decoding system that includes this symbol or a method to make python 'ignore' it.
For some context:
I am trying to make an auto checkout program for a website so my program needs to match the item that the user wants. To do this I am using Beatifulsoup to scrape information however this symbol '®' is within the names of a few of the items causing python to crash.
Here is the current command that I am using but is not working due to ASCII:
for colour in soup.find_all('a', attrs={"class":"name-link"}, href=True):
CnI.append(str(colour.text))
Uhrefs.append(str(colour.get('href')))
Any help would be appreciated
Here is the entirety of the program so far(ignore the mess nowhere near done):
import time
import webbrowser
from selenium import webdriver
import mechanize
from bs4 import BeautifulSoup
import urllib2
from selenium.webdriver.support.ui import Select
CnI = []
item = []
colour = []
Uhrefs = []
Whrefs = []
FinalColours = []
selectItemindex = []
selectColourindex = []
#counters
Ccounter = 0
Icounter = 0
Splitcounter = 1
#wanted items suffix options:jackets, shirts, tops_sweaters, sweatshirts, pants, shorts, hats, bags, accessories, skate
suffix = 'accessories'
Wcolour = 'Black'
Witem = '2-Tone Nylon 6-Panel'
driver=webdriver.Chrome()
driver.get('http://www.supremenewyork.com/shop/all/'+suffix)
html = driver.page_source
soup = BeautifulSoup(html, 'html.parser')
print(soup)
for colour in soup.find_all('a', attrs={"class":"name-link"}, href=True):
CnI.append(str(colour.text))
Uhrefs.append(str(colour.get('href')))
print(colour)
print('#############')
for each in CnI:
each.split(',')
print(each)
while Splitcounter<=len(CnI):
item.append(CnI[Splitcounter-1])
FinalColours.append(CnI[Splitcounter])
Whrefs.append(Uhrefs[Splitcounter])
Splitcounter+=2
print(Uhrefs)
for each in item:
print(each)
for z in FinalColours:
print(z)
for i in Whrefs:
print(i)
##for i in item:
## hold = item.index(i)
## print(hold)
## if Witem == i and Wcolour == FinalColours[i]:
## print('correct')
##
##
for count,elem in enumerate(item):
if Witem in elem:
selectItemindex.append(count+1)
for count,elem in enumerate(FinalColours):
if Wcolour in elem:
selectColourindex.append(count+1)
print(selectColourindex)
print(selectItemindex)
for each in selectColourindex:
if selectColourindex[Ccounter] in selectItemindex:
point = selectColourindex[Ccounter]
print(point)
else:
Ccounter+=1
web = 'http://www.supremenewyork.com'+Whrefs[point-1]
driver.get(web)
elem1 = driver.find_element_by_name('commit')
elem1.click()
time.sleep(1)
elem2 = driver.find_element_by_link_text('view/edit basket')
elem2.click()
time.sleep(1)
elem3 = driver.find_element_by_link_text('checkout now')
elem3.click()
"®" is not a character but a unicode codepoint so if you're using Python2, your code will never work. Instead of using str(), use something like this:
unicode(input_string, 'utf8')
# or
unicode(input_string, 'unicode-escape')
Edit: Given the code surrounding the initial snippet that was posted later and the fact that BeautifulSoup actually returns unicode already, it seems that removal of str() might be the best course of action and #MarkTolonen's answer is spot-on.
BeautifulSoup returns Unicode strings. Stop converting them back to byte strings. Best practice when dealing with text is to:
Decode incoming text to Unicode (what BeautifulSoup is doing).
Process all text using Unicode.
Encode outgoing text to Unicode (to file, to database, to sockets, etc.).
Small example of your issue:
text = u'\N{REGISTERED SIGN}' # syntax to create a Unicode codepoint by name.
bytes = str(text)
Output:
Traceback (most recent call last):
File "test.py", line 2, in <module>
bytes = str(text)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xae' in position 0: ordinal not in range(128)
Note the first line works and supports the character. Converting it to a byte string fails because it defaults to encoding in ASCII. You can explicitly encode it with another encoding (e.g. bytes = text.encode('utf8'), but that breaks rule 2 above and creates other issues.
Suggested reading:
https://nedbatchelder.com/text/unipain.html
https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/
I'm developing a chatbot with the chatterbot library. The chatbot is in my native language --> Slovene, which has a lot of strange characters (for example: š, č, ž). I'm using python 2.7.
When I try to train the bot, the library has trouble with the characters mentioned above. For example, when I run the following code:
chatBot.set_trainer(ListTrainer)
chatBot.train([
"Koliko imam še dopusta?",
"Letos imate še 19 dni dopusta.",
])
it throws the following error:
UnicodeDecodeError: 'utf8' codec can't decode byte 0x9a in position 12: invalid start byte
I added the # -*- coding: utf-8 -*- line to the top of my file, I also changed the encoding of all used files via my editor (Sublime text 3) to utf-8, I changed the system default encoding with the following code:
import sys
reload(sys)
sys.setdefaultencoding('utf8')
The strings are of type unicode.
When I try to get a response, with these strange characters, it works, it has no issues with them. For example, running the following code in the same execution as the above training code(when I change 'š' to 's' and 'č' to 'c', in the train strings), throws no errors:
chatBot.set_trainer(ListTrainer)
chatBot.train([
"Koliko imam se dopusta?",
"Letos imate se 19 dni dopusta.",
])
chatBot.get_response("Koliko imam še dopusta?")
I can't find a solution to this issue. Any suggestions?
Thanks loads in advance. :)
EDIT: I used from __future__ import unicode_literals, to make strings of type unicode. I also checked if they really were unicode with the method type(myString)
I would also like to paste this link.
EDIT 2: #MallikarjunaraoKosuri - s code works, but in my case, I had one more thing inside the chatbot instance intialization, which is the following:
chatBot = ChatBot(
'Test',
trainer='chatterbot.trainers.ListTrainer',
storage_adapter='chatterbot.storage.JsonFileStorageAdapter'
)
This is the cause of my error. The json storage file the chatbot creates, is created in my local encoding and not in utf-8. It seems the default storage (.sqlite3), doesn't have this issue, so for now I'll just avoid the json storage. But I am still interested in finding a solution to this error.
The strings from your example are not of type unicode.
Otherwise Python would not throw the UnicodeDecodeError.
This type of error says that at a certain step of program's execution Python tries to decode byte-string into unicode but for some reason fails.
In your case the reason is that:
decoding is configured by utf-8
your source file is not in utf-8 and almost certainly in cp1252:
import unicodedata
b = '\x9a'
# u = b.decode('utf-8') # UnicodeDecodeError: 'utf8' codec can't decode byte 0x9a
# in position 0: invalid start byte
u = b.decode('cp1252')
print unicodedata.name(u) # LATIN SMALL LETTER S WITH CARON
print u # š
So, the 0x9a byte from your cp1252 source can't be decoded with utf-8.
The best solution is to do nothing except convertation your source to utf-8.
With Sublime Text 3 you can easily do it by: File -> Reopen with Encoding -> UTF-8.
But don't forget to Ctrl+C your source code before the convertation beacuse just after that all your š, č, ž chars wil be replaced with ?.
Some of our friends are already suggested good part solutions, However again I would like combine all the solutions into one.
And author #gunthercox suggested some guidelines are described here http://chatterbot.readthedocs.io/en/stable/encoding.html#how-do-i-fix-python-encoding-errors
# -*- coding: utf-8 -*-
from chatterbot import ChatBot
# Create a new chat bot named Test
chatBot = ChatBot(
'Test',
trainer='chatterbot.trainers.ListTrainer'
)
chatBot.train([
"Koliko imam še dopusta?",
"Letos imate še 19 dni dopusta.",
])
Python Terminal
>>> # -*- coding: utf-8 -*-
... from chatterbot import ChatBot
>>>
>>> # Create a new chat bot named Test
... chatBot = ChatBot(
... 'Test',
... trainer='chatterbot.trainers.ListTrainer'
... )
>>>
>>> chatBot.train([
... "Koliko imam še dopusta?",
... "Letos imate še 19 dni dopusta.",
... ])
List Trainer: [####################] 100%
>>>
I have some py2 code that works in python 2:
import uuid
import hashlib
playername = "OfflinePlayer:%s" % name # name is just a text only username
m = hashlib.md5()
m.update(playername)
d = bytearray(m.digest())
d[6] &= 0x0f
d[6] |= 0x30
d[8] &= 0x3f
d[8] |= 0x80
print(uuid.UUID(bytes=str(d)))
However, when the code when run in python3, it produces "TypeError: Unicode-objects must be encoded before hashing" when m.update() is attempted. I tried to endcode it first with the default utf-8 by:
m.update(playername.encode())
but now this line -
print(uuid.UUID(bytes=str(d)))
produces this error:
File "/usr/lib/python3.5/uuid.py", line 149, in __init__
raise ValueError('bytes is not a 16-char string')
ValueError: bytes is not a 16-char string
I then tried to decode it back, but the bitwise operations have evidently ruined it (I am guessing?):
print(uuid.UUID(bytes=(d.decode())))
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xfd in position 2: invalid start byte
I don't honestly know what the purpose of the bitwise operations is in the "big-picture". The code snippet in general is supposed to produce the same expected UUID every time based on the spelling of the username.
I just want this code to do the same job in Python3 that it did in Python 2 :(
Thanks in advance.
A couple of things:
m.update(playername.encode('utf-8'))
Should properly encode your string.
print(uuid.UUID(bytes=bytes(d)))
Should return the UUID properly.
Example:
https://repl.it/EyuG/0
I have a code like that:
#!/usr/bin/env python3
#-*-coding:utf-8-*-
from PyQt4 import QtGui, QtCore
import re
.....
str = self.lineEdit.text() # lineEdit is a object in QtGui.QLineEdit class
# This line thanks to Fedor Gogolev et al from
#https://stackoverflow.com/questions/12214801/print-a-string-as-hex-bytes
print('\\u'+"\\u".join("{:x}".format(ord(c)) for c in str))
# u+20000-u+2a6d6 is CJK Ext B
cjk = re.compile("^[一-鿌㐀-䶵\U00020000-\U0002A6D6]+$",re.UNICODE)
if cjk.match(str):
print("OK")
else:
print("error")
when I inputted "敏感詞" (0x654F,0x611F, 0x8A5E in utf16 respectively), the result was:
\u654f\u611f\u8a5e
OK
but when I input "詞𠀷𠂁𠁍" (0x8A5E, 0xD840 0xDC37, 0xD840 0xDC81, 0xD840 0xDC4D in utf-16) in which there were 3 characters from CJK Extention B Area. The result which is not expected is:
\u8a5e\ud840\udc37\ud840\udc81\ud840\udc4d
error
how can I processed these CJK characters with converting to utf-8 to be processed suitabliy with re of Python3?
P.S.
the value from sys.maxunicode is 1114111, it might be UCS-4. Hence, I think that the question seems not to be the same as
python regex fails to match a specific Unicode > 2 hex values
another code:
#!/usr/bin/env python3
#-*-coding:utf-8-*-
import re
CJKBlock = re.compile("^[一-鿌㐀-䶵\U00020000-\U0002A6D6]+$") #CJK ext B
print(CJKBlock.search('詞𠀷𠂁𠁍'))
returns <_sre.SRE_Match object; span=(0, 4), match='詞𠀷𠂁𠁍'> #expected result.
even I added self.lineEdit.setText("詞𠀷𠂁𠁍") inside __init__ function of the window class and executed it, the word in LineEdit shows appropriately, but when I pressed enter, the result was still "error"
version:
Python3.4.3
Qt version: 4.8.6
PyQt version: 4.10.4.
There were a few PyQt4 bugs following the implemetation of PEP-393 that can affect conversions between QString and python strings. If you use sip to switch to the v1 API, you should probably be able to confirm that the QString returned by the line-edit does not contain surrogate pairs. But if you then convert it to a python string, the surrogates should appear.
Here is how to test this in an interactive session:
>>> import sip
>>> sip.setapi('QString', 1)
>>> from PyQt4 import QtGui
>>> app = QtGui.QApplication([])
>>> w = QtGui.QLineEdit()
>>> w.setText('詞𠀷𠂁𠁍')
>>> qstr = w.text()
>>> qstr
PyQt4.QtCore.QString('詞𠀷𠂁𠁍')
>>> pystr = str(qstr)
>>> print('\\u' + '\\u'.join('{:x}'.format(ord(c)) for c in pystr))
\u8a5e\u20037\u20081\u2004d
Of course, this last line does not show surrogates for me, because I cannot do the test with PyQt-4.10.4. I have tested with PyQt-4.11.1 and PyQt-4.11.4, though, and I did not get see any problems. So you should try to upgrade to one of those.
I am experiencing a strange problem that returns the same error, regardless of the encoding I use. The code works well, without the encoding part in Python 2.7.8, but it breaks in 2.7.6 which is the version that I use for all my development.
import MIDI_PY2 as md
import glob
import ast
import os
dir = '/Users/user/Desktop/sample midis/'
os.chdir(dir)
file_list = []
for file in glob.glob('*.mid'):
file_list.append((dir + file))
dir = '/Users/user/Desktop/sample midis/'
os.chdir(dir)
file_list returns this:
[u'/Users/user/Desktop/sample midis/M1.mid',
u'/Users/user/Desktop/sample midis/M2.mid',
u'/Users/user/Desktop/sample midis/M3.mid',
u'/Users/user/Desktop/sample midis/M4.mid']
md.concatenate_midis(file_list,'/Users/luissanchez/Desktop/temp/out.mid') returns this error:
-
TypeError Traceback (most recent call last)
<ipython-input-73-2d7eef92f566> in <module>()
----> 1 md.concatenate_midis(file_list_1,'/Users/user/Desktop/temp/out.mid')
/Users/user/Desktop/sample midis/MIDI_PY2.pyc in concatenate_midis(paths, outPath)
/Users/user/Desktop/sample midis/MIDI_PY2.pyc in midi2score(midi)
/Users/user/Desktop/sample midis/MIDI_PY2.pyc in midi2opus(midi)
TypeError: Struct() argument 1 must be string, not unicode
then I modify the code so the first argument is string, not unicode:
file_list_1 = [str(x) for x in file_list]
which returns:
['/Users/user/Desktop/sample midis/M1.mid',
'/Users/user/Desktop/sample midis/M2.mid',
'/Users/user/Desktop/sample midis/M3.mid',
'/Users/user/Desktop/sample midis/M4.mid']
running the function concatenate_midis with this last list (file_list_1) returns exactly the same error: TypeError: Struct() argument 1 must be string, not unicode.
Does anybody knows what's going on here? concatenate_midi works well in python 2.7.8, but can't figure out why it doesn't work in what I use, Enthought Canopy Python 2.7.6 | 64-bit
Thanks
The error
error: TypeError: Struct() argument 1 must be string, not unicode.
is usually caused by the struct.unpack() function which in older versions of python requires string arguments and not unicode. Check that struct.unpack() arguments are strings and not unicodes.
One possible cause is from __future__ .. statement.
>>> type('a')
<type 'str'>
>>> from __future__ import unicode_literals
>>> type('a')
<type 'unicode'>
Check whether your code contains the statement.