BeautifulSoup parsing unicode giving variable results

BeautifulSoup parsing unicode giving variable results - python-2.7

I am trying to to parse the following ipython notebook however I am getting varying results when I read the unicode into a BeautifulSoup object, i.e.
from IPython.nbconvert.exporters import HTMLExporter
from IPython.config import Config
from bs4 import BeautifulSoup
filepath = '2015-05-01_test2.ipynb'
config = Config({'CSSHTMLHeaderTransformer': {'enabled': True,
'highlight_class': '.highlight-ipynb'}})
exporter = HTMLExporter(config=config, template_file='basic')
content, info = exporter.from_filename(filepath)
soup = BeautifulSoup(content)
soup2 = BeautifulSoup(content)
soup3 = BeautifulSoup(content)
soup4 = BeautifulSoup(content)
print(soup == soup2)
print(soup == soup3)
print(soup == soup4)
Gives output
False
True
True
However, if I run the following snippet multiple times the output constantly toggles between a different set of True/False combinations. I am very much at a loss for what could be causing this, has anyone run into this type of issue before?
The relevant notebook can be found here
Edit
Upon further investigation, this issue seems to only happen on Linux. I have tried this on Linux Mint and Ubuntu as well as two windows machines and the issue appears to only happen on Linux. Note that I am using IPython 3.0.0 and bs4 4.3.2

Another case of RTM, specifying the specific parser to use seems to solve the problem.
soup = BeautifulSoup(content, 'html.parser')
or
soup = BeautifulSoup(content, 'xml')

Related

Pyqt4 to Pyqt5 and python code error

i switch from Pyqt4 to Pyqt5 and from python2.7 to 3.4 and i found a bug in Pyqt5 so i upgrade it to the last version which is not suppoted by python3.4 . than i used python3.5 and it work fine except one module called pymaxwell.
always the app crash and closed and to confirm i go back to python2.7 and i use pyqt5 for python2.7 and the same error ; the app closed immediately and show me an error in a part in the code which is work well with pyqt4.
in the gif a comparaison between Pyqt4/Pyqt5 with python2.7
comparison
the part of the code which have problem:
self.btnMXSmxm()
self.mxs_filt.setChecked(True)
self.show_folder.setChecked(False)
inPath = self.input_mxs_scene_2.text();
self.input_mxs_scene.setText(inPath);
self.mxm_list.clear()
if len(inPath) == 0:
self.chosen_material.clear()
# loop over all mxs
else:
mxsList = self.input_mxs_scene.text()
print mxsList
if not len(mxsList) == 0:
scene = Cmaxwell(mwcallback);
ok = scene.readMXS(mxsList);
sceneMaterials = scene.getMaterialNames();
materialcount = int(scene.getMaterialsCount()[0])
if os.path.isfile(self.input_mxs_scene.text()):
for name in sceneMaterials:
scenematerial = scene.getMaterial(name)

problem solved;
i use python 3.5.4 and pyqt 5.9
now even i use read(file) or read(str(file)) don't give any error.
and i find that with pyqt5 it need new arrangement of some lines in the code.
i modify the code and it work :
mxsList = self.input_mxs_scene.text()
scene = Cmaxwell(mwcallback)
materialcount = int(scene.getMaterialsCount()[0])
if not len(inPath) == 0:
if os.path.isfile(self.input_mxs_scene.text()):
ok = scene.readMXS(mxsList)
sceneMaterials = scene.getMaterialNames()
if not len(mxsList) == 0:
for name in sceneMaterials:
scenematerial = scene.getMaterial(name)
# #print scenematerial
self.mxm_list.addItem(name)
i also add : from PyQt5.QtCore import *
that solve other problem with reading media files
after many tries with different versions of python and pyqt5 ; the best versions works well together , python 2.7.9 and pyqt5 from github :
PyQt 5.7.1 for Python 2.7

python + wx & uno to fill libreoffice using ubuntu 14.04

I collected user data using a wx python gui and than I used uno to fill this data into an openoffice document under ubuntu 10.xx
user + my-script ( +empty document ) --> prefilled document
After upgrading to ubuntu 14.04 uno doesn't work with python 2.7 anymore and now we have libreoffice instead of openoffice in ubuntu. when I try to run my python2.7 code, it says:
ImportError: No module named uno
How could I bring it back to work?
what I tried:
installed https://pypi.python.org/pypi/unotools v0.3.3
sudo apt-get install libreoffice-script-provider-python
converted the code to python3 and got uno importable, but wx is not importable in python3 :-/
ImportError: No module named 'wx'
googled and read python3 only works with wx phoenix
so tried to install: http://wxpython.org/Phoenix/snapshot-builds/
but wasn't able to get it to run with python3
is there a way to get the uno bridge to work with py2.7 under ubuntu 14.04?
Or how to get wx to run with py3?
what else could I try?

Create a python macro in LibreOffice that will do the work of inserting the data into LibreOffice and then in your python 2.7 code envoke the macro.
As the macro is running from with LibreOffice it will use python3.
Here is an example of how to envoke a LibreOffice macro from the command line:
#!/usr/bin/python3
# -*- coding: utf-8 -*-
##
# a python script to run a libreoffice python macro externally
# NOTE: for this to run start libreoffice in the following manner
# soffice "--accept=socket,host=127.0.0.1,port=2002,tcpNoDelay=1;urp;" --writer --norestore
# OR
# nohup soffice "--accept=socket,host=127.0.0.1,port=2002,tcpNoDelay=1;urp;" --writer --norestore &
#
import uno
from com.sun.star.connection import NoConnectException
from com.sun.star.uno import RuntimeException
from com.sun.star.uno import Exception
from com.sun.star.lang import IllegalArgumentException
def uno_directmacro(*args):
localContext = uno.getComponentContext()
localsmgr = localContext.ServiceManager
resolver = localsmgr.createInstanceWithContext("com.sun.star.bridge.UnoUrlResolver", localContext )
try:
ctx = resolver.resolve("uno:socket,host=localhost,port=2002;urp;StarOffice.ComponentContext")
except NoConnectException as e:
print ("LibreOffice is not running or not listening on the port given - ("+e.Message+")")
return
msp = ctx.getValueByName("/singletons/com.sun.star.script.provider.theMasterScriptProviderFactory")
sp = msp.createScriptProvider("")
scriptx = sp.getScript('vnd.sun.star.script:directmacro.py$directmacro?language=Python&location=user')
try:
scriptx.invoke((), (), ())
except IllegalArgumentException as e:
print ("The command given is invalid ( "+ e.Message+ ")")
return
except RuntimeException as e:
print("An unknown error occurred: " + e.Message)
return
except Exception as e:
print ("Script error ( "+ e.Message+ ")")
print(e)
return
return(None)
uno_directmacro()
And this is the corresponding macro code within LibreOffice called "directmacro.py" and stored in the User area for libreOffice macros (which would normally be $HOME/.config/libreoffice/4/user/Scripts/python :
#!/usr/bin/python
from com.sun.star.awt.MessageBoxButtons import BUTTONS_OK, BUTTONS_OK_CANCEL, BUTTONS_YES_NO, BUTTONS_YES_NO_CANCEL, BUTTONS_RETRY_CANCEL, BUTTONS_ABORT_IGNORE_RETRY
from com.sun.star.awt.MessageBoxButtons import DEFAULT_BUTTON_OK, DEFAULT_BUTTON_CANCEL, DEFAULT_BUTTON_RETRY, DEFAULT_BUTTON_YES, DEFAULT_BUTTON_NO, DEFAULT_BUTTON_IGNORE
from com.sun.star.awt.MessageBoxType import MESSAGEBOX, INFOBOX, WARNINGBOX, ERRORBOX, QUERYBOX
def directmacro(*args):
import socket, time
class FontSlant():
from com.sun.star.awt.FontSlant import (NONE, ITALIC,)
#get the doc from the scripting context which is made available to all scripts
desktop = XSCRIPTCONTEXT.getDesktop()
model = desktop.getCurrentComponent()
text = model.Text
tRange = text.End
cursor = desktop.getCurrentComponent().getCurrentController().getViewCursor()
doc = XSCRIPTCONTEXT.getDocument()
parentwindow = doc.CurrentController.Frame.ContainerWindow
# your cannot insert simple text and text into a table with the same method
# so we have to know if we are in a table or not.
# oTable and oCurCell will be null if we are not in a table
oTable = cursor.TextTable
oCurCell = cursor.Cell
insert_text = "This is text inserted into a LibreOffice Document\ndirectly from a macro called externally"
Text_Italic = FontSlant.ITALIC
Text_None = FontSlant.NONE
cursor.CharPosture=Text_Italic
if oCurCell == None: # Are we inserting into a table or not?
text.insertString(cursor, insert_text, 0)
else:
cell = oTable.getCellByName(oCurCell.CellName)
cell.insertString(cursor, insert_text, False)
cursor.CharPosture=Text_None
return None
You will of course need to adapt the code to either accept data as arguments, read it from a file or whatever.

Ideally I would say use python 3, because python 2 is becoming outdated. The switch requires quite a bit of new coding changes, but better sooner than later. So I tried:
sudo pip3 install -U --pre \
-f http://wxpython.org/Phoenix/snapshot-builds/ \
wxPython_Phoenix
However this gave me errors, and I didn't want to spend the next couple of days working through them. Probably the pre-release versions are not ready for prime time yet.
So instead, what I recommend is to switch to AOO for now. See https://stackoverflow.com/a/27980255/5100564 for instructions. AOO does not have all the latest features that LO has, but it is a good solid Office product.
Apparently it is also possible to rebuild LibreOffice with python 2 using this script: https://gist.github.com/hbrunn/6f4a007a6ff7f75c0f8b

'foo.bar.com.s3.amazonaws.com' doesn't match either of '*.s3.amazonaws.com', 's3.amazonaws.com'

I'm using django, and things like imgs I store at s3 (for this I'm using boto), but recently I got this error:
'foo.bar.com.s3.amazonaws.com' doesn't match either of
'*.s3.amazonaws.com', 's3.amazonaws.com'
I'm searching for a possible solution for about two days, but the unique things that is suggested is to change boto's source code, however I can't do this on production.
Edit: Using Django 1.58, Boto 2.38.0
Any help would be appreciated.
Thx in advance.

As has been said, the problem occurs in buckets containing dots on name. Below just an example that prevents this.
import boto
from boto.s3.connection import VHostCallingFormat
c = boto.connect_s3(aws_access_key_id='your-access-key',
is_secure=False,
aws_secret_access_key='your-secret-access',
calling_format=VHostCallingFormat())
b = c.get_bucket(bucket_name='your.bucket.with.dots', validate=True)
print(b)

You can use this monkey patch in your connection.py file (boto/connection.py):
import ssl
_old_match_hostname = ssl.match_hostname
def _new_match_hostname(cert, hostname):
if hostname.endswith('.s3.amazonaws.com'):
pos = hostname.find('.s3.amazonaws.com')
hostname = hostname[:pos].replace('.', '') + hostname[pos:]
return _old_match_hostname(cert, hostname)
ssl.match_hostname = _new_match_hostname
(source)
another solution is in here:

It's an known issue : #2836. It's due to dots in your bucket name.
I had this issue some days before. A user seems to have succeeded to fix this by setting :
AWS_S3_HOST = 's3-eu-central-1.amazonaws.com'
AWS_S3_CALLING_FORMAT = 'boto.s3.connection.OrdinaryCallingFormat
But it didn't work for me.
Otherwise, you can create a bucket without points (E.g : foo-bar-com). It will work. This is what I did to temporarily fix this issue.

I had the same error when I moved from python3.4 to python3.5 (boto version remained the same).
The problem was solved by moving from boto to boto3 package.
Example:
import boto3
s3_client = boto3.client("s3", aws_access_key_id=aws_access_key, aws_secret_access_key=aws_secret_key)
body = open(local_filename, 'rb').read()
resp = s3_client.put_object(ACL="private", Body=body, Bucket=your_bucket_name, Key=path_to_the_file_inside_the_bucket)
if resp["ResponseMetadata"]["HTTPStatusCode"] != 200:
raise Exception("Something went wrong...")

Downloading a file in Python 2.x vs 3.x

I'm working on a script that checks multiple domain names if they are registered or not. It loops through a list of the domains read from a file, and the registration check is done using enom's API. My problem is the code accessing the API in Python 2:
import urllib2
import xml.etree.ElementTree as ET
...
response = urllib2.urlopen(url)
content = ET.fromstring(response.read())
...
return content.find("RRPCode").text
generates the error: 'NoneType' object has no attribute 'text' in nearly 30% of the checks. While the Python 3 code:
import urllib.request
import xml.etree.ElementTree as ET
...
response = urllib.request.urlopen(url)
content = ET.fromstring(response.read())
...
return content.find("RRPCode").text
works fine. I should also mention that the number of errors returned are random and not related to specific domain names.
What could be the cause of these errors?
I am by the way using Python 2.7.3 and Python 3.2.3 on a VPS running Ubuntu 12.04 server.
Thanks.

Program for Backup - Python

Im trying to execute the following code in Python 2.7 on Windows7. The purpose of the code is to take back up from the specified folder to a specified folder as per the naming pattern given.
However, Im not able to get it work. The output has always been 'Backup Failed'.
Please advise on how I get resolve this to get the code working.
Thanks.
Code :
backup_ver1.py
import os
import time
import sys
sys.path.append('C:\Python27\GnuWin32\bin')
source = 'C:\New'
target_dir = 'E:\Backup'
target = target_dir + os.sep + time.strftime('%Y%m%d%H%M%S') + '.zip'
zip_command = "zip -qr {0} {1}".format(target,''.join(source))
print('This is a program for backing up files')
print(zip_command)
if os.system(zip_command)==0:
print('Successful backup to', target)
else:
print('Backup FAILED')

See if escaping the \'s helps :-
source = 'C:\\New'
target_dir = 'E:\\Backup'

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

BeautifulSoup parsing unicode giving variable results - python-2.7

Another case of RTM, specifying the specific parser to use seems to solve the problem. soup = BeautifulSoup(content, 'html.parser') or soup = BeautifulSoup(content, 'xml')

Related

Pyqt4 to Pyqt5 and python code error

python + wx & uno to fill libreoffice using ubuntu 14.04

'foo.bar.com.s3.amazonaws.com' doesn't match either of '*.s3.amazonaws.com', 's3.amazonaws.com'

Downloading a file in Python 2.x vs 3.x

Program for Backup - Python

Categories

Resources