Regex dealing with Kanji characters in Python - regex

so for this web-scraping project i'm working on, I've been trying to separate some results from results.
basically if the title contains 指定されたページが見つかりません , i'll want to copy the url and write it to one fail.csv file. Anything else i'll want to copy the url and write it to sucess.csv
html = 'www.abc.com'
url = BeautifulSoup(html,'html.parser').title.string
pattern = re.compile(r' 指定されたページが見つかりません')
if pattern.finditer(url):
with open('fail.csv','w') as f:
cw=csv.writer
cw.writerow([url])
else:
move on, run some other codes and write to sucess.csv
However it seems that regex isn't recognising 指定されたページが見つかりません
Am i doing something wrong here or missing something here?
Thanks

Try
sudo pip3 install requests
sudo pip3 install beautifulsoup4
sudo pip3 install re
and under python3
import requests
import re
from bs4 import BeautifulSoup
r = requests.get('https://corp.rakuten.co.jp/careers/life/')
r.encoding='utf-8'
pattern = re.compile(r' 指定されたページが見つかりません')
url = BeautifulSoup(r.text,'html.parser').title.string
pattern.findall(url)

Related

Python script on Django shell not seeing import if import not set as global?

I have searched the stackoverflow and wasn't able to find this. I have noticed something I can not wrap my head around. When run as normal python script import works ok, but when run from Django shell it behaves weird, needs to set import as global to be seen.
You can reproduce it like this. Make a file test.py in folder with manage.py. Code you can test with is this.
This doesn't work, code of test.py:
#!/usr/bin/env python3
import chardet
class LoadList():
def __init__(self):
self.email_list_path = '/home/omer/test.csv'
#staticmethod
def check_file_encoding(file_to_check):
encoding = chardet.detect(open(file_to_check, "rb").read())
return encoding
def get_encoding(self):
return self.check_file_encoding(self.email_list_path)['encoding']
print(LoadList().get_encoding())
This works ok when chardet set as global inside test.py file:
#!/usr/bin/env python3
import chardet
class LoadList():
def __init__(self):
self.email_list_path = '/home/omer/test.csv'
#staticmethod
def check_file_encoding(file_to_check):
global chardet
encoding = chardet.detect(open(file_to_check, "rb").read())
return encoding
def get_encoding(self):
return self.check_file_encoding(self.email_list_path)['encoding']
print(LoadList().get_encoding())
First run is without global chardet and you can see the error. Second run is with global chardet set and you can see it works ok.
What is going on and can someone explain this to me? Why it isn't seen until set as global?
Piping a file into shell is the same as piping it into the python command. It's not the same as running the file with python test.py. I suspect it's something to do with the way the the newlines are interpreted as to how the file is really parsed, but don't have time to check.
Instead of this approach I'd recommend you write a custom management command.

memcache on django is not working

I have a race condition in Celery. Inspired by this - http://ask.github.io/celery/cookbook/tasks.html#ensuring-a-task-is-only-executed-one-at-a-time I decided to use memcache to add locks to my tasks.
These are the changes I made:
python-memcached
# settings for memcache
CACHES = {
'default': {
'BACKEND': 'django.core.cache.backends.memcached.MemcachedCache',
'LOCATION': '127.0.0.1:11211',
}
}
After this I login to my shell and do the following
>>> import os
>>> import django
>>> from django.core.cache import cache
>>> os.environ.setdefault('DJANGO_SETTINGS_MODULE', 'proj.settings.base')
>>> cache
<django.core.cache.DefaultCacheProxy object at 0x101e4c860>
>>> cache.set('my_key', 'hello, world!', 30) #display nothing. No T/F
>>> cache.get('my_key') #Display nothing.
>>> from django.core.cache import caches
>>> caches['default']
<django.core.cache.backends.memcached.MemcachedCache object at 0x1048a5208>
>>> caches['default'].set('my_key', 'hello, world!', 30) #display nothing. No T/F
>>> caches['default'].get('my_key') #Display nothing.
also did pip install python-memcached
Using Python 3.6, Django==1.10.5
What am I doing wrong? Any help will be appreciated.
The problem was, memcached was killed for some reason and I was assumed it was still running. My bad.
Now it works all perfectly.
For anyone who is stuck on a similar problem you want to make sure you are still running memcached, try memcached -vv
Keeping this here for reference.
If your install is looking like mine is starting to look (in which case, you have my sympathies; for RHEL is one platform where django seems thinner on-the-ground than this-week's python):
# yum upgrade -y
yum-config-manager --add-repo=https://dl.fedoraproject.org/pub/epel/7/x86_64/
# yum install -y emacs-nox
yum install -y python{-sqlparse,36{,-{devel,pip,pytz,bcrypt}}} memcached
service memcached start
chkconfig memcached on
python36 \
-m pip install \
django \
pymemcache \
--no-deps --upgrade
then checking the status on memcache is as easy with the same commands we've been using for 20 years:
service memcached status
There are other commands one can use on RHEL/EL7 (#fridgeArt), but I prefer compatible workflows in spite of resume-driven differentiation. >:-(
Here's how to launch it with -vv under EL and debuntus too: https://stackoverflow.com/a/22239764/2066657

urllib module errors launching script with python 2.7.13

i want to launch a script to get a file by url with urllib but I always get a sequence of errors:
I checked the documentation and says that something is deprecated but i can't find the right syntax.
import urllib
fhand = urllib.urlopen('http://www-dr-chuck.com/page1.htm')
for line in fhand:
print line.strip()
Your URL is wrong.
http://www-dr-chuck.com/page1.htm
I think it should be
http://www.dr-chuck.com/page1.htm
There should be a dot . after www not a dash -.

python + wx & uno to fill libreoffice using ubuntu 14.04

I collected user data using a wx python gui and than I used uno to fill this data into an openoffice document under ubuntu 10.xx
user + my-script ( +empty document ) --> prefilled document
After upgrading to ubuntu 14.04 uno doesn't work with python 2.7 anymore and now we have libreoffice instead of openoffice in ubuntu. when I try to run my python2.7 code, it says:
ImportError: No module named uno
How could I bring it back to work?
what I tried:
installed https://pypi.python.org/pypi/unotools v0.3.3
sudo apt-get install libreoffice-script-provider-python
converted the code to python3 and got uno importable, but wx is not importable in python3 :-/
ImportError: No module named 'wx'
googled and read python3 only works with wx phoenix
so tried to install: http://wxpython.org/Phoenix/snapshot-builds/
but wasn't able to get it to run with python3
is there a way to get the uno bridge to work with py2.7 under ubuntu 14.04?
Or how to get wx to run with py3?
what else could I try?
Create a python macro in LibreOffice that will do the work of inserting the data into LibreOffice and then in your python 2.7 code envoke the macro.
As the macro is running from with LibreOffice it will use python3.
Here is an example of how to envoke a LibreOffice macro from the command line:
#!/usr/bin/python3
# -*- coding: utf-8 -*-
##
# a python script to run a libreoffice python macro externally
# NOTE: for this to run start libreoffice in the following manner
# soffice "--accept=socket,host=127.0.0.1,port=2002,tcpNoDelay=1;urp;" --writer --norestore
# OR
# nohup soffice "--accept=socket,host=127.0.0.1,port=2002,tcpNoDelay=1;urp;" --writer --norestore &
#
import uno
from com.sun.star.connection import NoConnectException
from com.sun.star.uno import RuntimeException
from com.sun.star.uno import Exception
from com.sun.star.lang import IllegalArgumentException
def uno_directmacro(*args):
localContext = uno.getComponentContext()
localsmgr = localContext.ServiceManager
resolver = localsmgr.createInstanceWithContext("com.sun.star.bridge.UnoUrlResolver", localContext )
try:
ctx = resolver.resolve("uno:socket,host=localhost,port=2002;urp;StarOffice.ComponentContext")
except NoConnectException as e:
print ("LibreOffice is not running or not listening on the port given - ("+e.Message+")")
return
msp = ctx.getValueByName("/singletons/com.sun.star.script.provider.theMasterScriptProviderFactory")
sp = msp.createScriptProvider("")
scriptx = sp.getScript('vnd.sun.star.script:directmacro.py$directmacro?language=Python&location=user')
try:
scriptx.invoke((), (), ())
except IllegalArgumentException as e:
print ("The command given is invalid ( "+ e.Message+ ")")
return
except RuntimeException as e:
print("An unknown error occurred: " + e.Message)
return
except Exception as e:
print ("Script error ( "+ e.Message+ ")")
print(e)
return
return(None)
uno_directmacro()
And this is the corresponding macro code within LibreOffice called "directmacro.py" and stored in the User area for libreOffice macros (which would normally be $HOME/.config/libreoffice/4/user/Scripts/python :
#!/usr/bin/python
from com.sun.star.awt.MessageBoxButtons import BUTTONS_OK, BUTTONS_OK_CANCEL, BUTTONS_YES_NO, BUTTONS_YES_NO_CANCEL, BUTTONS_RETRY_CANCEL, BUTTONS_ABORT_IGNORE_RETRY
from com.sun.star.awt.MessageBoxButtons import DEFAULT_BUTTON_OK, DEFAULT_BUTTON_CANCEL, DEFAULT_BUTTON_RETRY, DEFAULT_BUTTON_YES, DEFAULT_BUTTON_NO, DEFAULT_BUTTON_IGNORE
from com.sun.star.awt.MessageBoxType import MESSAGEBOX, INFOBOX, WARNINGBOX, ERRORBOX, QUERYBOX
def directmacro(*args):
import socket, time
class FontSlant():
from com.sun.star.awt.FontSlant import (NONE, ITALIC,)
#get the doc from the scripting context which is made available to all scripts
desktop = XSCRIPTCONTEXT.getDesktop()
model = desktop.getCurrentComponent()
text = model.Text
tRange = text.End
cursor = desktop.getCurrentComponent().getCurrentController().getViewCursor()
doc = XSCRIPTCONTEXT.getDocument()
parentwindow = doc.CurrentController.Frame.ContainerWindow
# your cannot insert simple text and text into a table with the same method
# so we have to know if we are in a table or not.
# oTable and oCurCell will be null if we are not in a table
oTable = cursor.TextTable
oCurCell = cursor.Cell
insert_text = "This is text inserted into a LibreOffice Document\ndirectly from a macro called externally"
Text_Italic = FontSlant.ITALIC
Text_None = FontSlant.NONE
cursor.CharPosture=Text_Italic
if oCurCell == None: # Are we inserting into a table or not?
text.insertString(cursor, insert_text, 0)
else:
cell = oTable.getCellByName(oCurCell.CellName)
cell.insertString(cursor, insert_text, False)
cursor.CharPosture=Text_None
return None
You will of course need to adapt the code to either accept data as arguments, read it from a file or whatever.
Ideally I would say use python 3, because python 2 is becoming outdated. The switch requires quite a bit of new coding changes, but better sooner than later. So I tried:
sudo pip3 install -U --pre \
-f http://wxpython.org/Phoenix/snapshot-builds/ \
wxPython_Phoenix
However this gave me errors, and I didn't want to spend the next couple of days working through them. Probably the pre-release versions are not ready for prime time yet.
So instead, what I recommend is to switch to AOO for now. See https://stackoverflow.com/a/27980255/5100564 for instructions. AOO does not have all the latest features that LO has, but it is a good solid Office product.
Apparently it is also possible to rebuild LibreOffice with python 2 using this script: https://gist.github.com/hbrunn/6f4a007a6ff7f75c0f8b

pywin32 and pyttsx error, trouble combining the two

i have pywin32 in my site packages and my pyttsx is in a separate folder. Is this the reason why i am getting the following error?
import win32api, sys, os
ImportError: DLL load failed: The specified module could not be found
The code is as follows,
import pyttsx
def onStart(name):
print 'starting', name
def onWord(name, location, length):
print 'word', name, location, length
def onEnd(name, completed):
print 'finishing', name, completed
engine = pyttsx.init()
engine.connect('started-utterance', onStart)
engine.connect('started-word', onWord)
engine.connect('finished-utterance', onEnd)
engine.say('The quick brown fox jumped over the lazy dog.')
engine.runAndWait()
from here, http://pyttsx.readthedocs.org/en/latest/engine.html#examples
My pywin32 is from here,
http://sourceforge.net/projects/pywin32/files/pywin32/Build%20219/
for Py 2.7
The problem was that the file
pywintypes27.dll
was not in the right directory. It had to be in
'C:\Windows\System32'
#CristiFati
Use the pyttsx3 module instead . It supports both python3 and python2.
To install:
pip install pyttsx3.
It automatically installs those win32 and other dependencies.