Regular Expression '[\w-]+(\.[\w-]+)*' doesn't get matched - regex

I want to process some sentences in the document of PostgreSQL and do some analysis. In the word spliting stage, I tried to use the regex '[\w-]+(.[\w-]+)*' proposed by Lotufo et al. in the article Modelling the Hurried bug report reading process to summarize
bug reports. It's quite strange that I cann't get the expected answer using this regex in Python.
Python 3.6.5 |Anaconda, Inc.| (default, Mar 29 2018, 13:32:41) [MSC v.1900 64 bit (AMD64)]
Type "copyright", "credits" or "license" for more information.
IPython 6.4.0 -- An enhanced Interactive Python.
>>> import re
>>> result = re.findall(r'[\w-]+(\.[\w-]+)*', 'Specifies the directory to use for data storage.')
>>> print(result)
I expected to get a list of words:
['Specifies', 'the', 'directory', 'to', 'use', 'for', 'data', 'storage']
But I only got a list of empty string:
['', '', '', '', '', '', '', '']
Does any one have any idea what is wrong with my code? Thanks a lot.

This works the way you were expecting:
Python 3.7.2 (default, Jan 16 2019, 19:49:22)
[GCC 8.2.1 20181215 (Red Hat 8.2.1-6)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import re
>>> split = re.compile('(\w+)')
>>> split.findall('Specifies the directory to use for data storage.')
['Specifies', 'the', 'directory', 'to', 'use', 'for', 'data', 'storage']
>>>
Those square brackets on your regular expression don't feel right. I guess they are the cause.

The expected strings are matched, but they aren't in a capturing group. Use this regex instead:
r'([\w-]+(?:\.[\w-]+)*)'
Note that I added ?: to the inner parentheses to make them non-capturing.

Related

I want to use the openslide functionality in pyvips

I tried to use both openslide and pyvips and my application doesn't find the necesary .dll. I think it is a problem of using both librarys.
I have read that pyvips has openslide embed but I can't find how to use it. The main purpose for this is to read Whole Slide Images and see the different levels and augmentations, and work with them.
I'd really appreciate your help! Thank you
Yes, pyvips usually includes openslide, so you can't use both together.
Use .get_fields() to see all the metadata on an image, for example:
$ python3
Python 3.9.7 (default, Sep 10 2021, 14:59:43)
[GCC 11.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pyvips
>>> x = pyvips.Image.new_from_file("openslide/CMU-1.svs")
>>> x.width
46000
>>> x.height
32914
>>> x.get_fields()
['width', 'height', 'bands', 'format', 'coding', 'interpretation', 'xoffset', 'yoffset',
'xres', 'yres', 'filename', 'vips-loader', 'slide-level', 'aperio.AppMag', 'aperio.Date',
'aperio.Filename', 'aperio.Filtered', 'aperio.Focus Offset', 'aperio.ICC Profile',
'aperio.ImageID', 'aperio.Left', 'aperio.LineAreaXOffset', 'aperio.LineAreaYOffset',
...
pyvips will open base level of the image by default (the largest), use level= to pick other levels, perhaps:
>>> x = pyvips.Image.new_from_file("openslide/CMU-1.svs", level=2)
>>> x.width
2875
See the docs for details:
https://www.libvips.org/API/current/VipsForeignSave.html#vips-openslideload

How to extract coordinates(lat, lan) from a URL in Python?

I'm a bit lost on how to extract coordinates (Lat, Long) from a URL in Python.
Always I'll recive a url like this:
https://www.testweb.com/cordi?ll=41.403781,2.1896&z=17&pll=41.403781,2.1896
Where I need to extract the second set of this URL (in this case: 41.403781,2.1896) Just to say, that not always the first and second set of coords will be the same.
I know, that can be done with some regex, but I'm not good enough on it.
Here's how to do it with a regular expression:
import re
m = re.search(r'pll=(\d+\.\d+),(\d+\.\d+)', 'https://www.testweb.com/cordi?ll=41.403781,2.1896&z=17&pll=41.403781,2.1896')
print m.groups()
Result: ('41.403781', '2.1896')
You might want look at the module urlparse for a more robust solution.
urlparse has a functions "urlparse" and "parse_qs" for accessing this data reliably, as shown below
$ python
Python 2.6.6 (r266:84292, Jul 23 2015, 15:22:56)
[GCC 4.4.7 20120313 (Red Hat 4.4.7-11)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> u="""https://www.testweb.com/cordi?ll=41.403781,2.1896&z=17&pll=41.403781,2.1896"""
>>> import urlparse
>>> x=urlparse.urlparse(u)
>>> x
ParseResult(scheme='https', netloc='www.testweb.com', path='/cordi', params='', query='ll=41.403781,2.1896&z=17&pll=41.403781,2.1896', fragment='')
>>> x.query
'll=41.403781,2.1896&z=17&pll=41.403781,2.1896'
>>> urlparse.parse_qs(x.query)
{'ll': ['41.403781,2.1896'], 'z': ['17'], 'pll': ['41.403781,2.1896']}
>>>

how to mock subprocess.call in a unittest

I'm on python 3.3 and I have to test a method which use call from subprocess.py.
I tried:
subprocess.call = MagicMock()
with patch('subprocess.call') as TU_call:
but in debug mode I found that python call effectively subprocess.call
Works fine for me (Ubuntu 13.04, Python 3.3.1):
$ python3.3
Python 3.3.1 (default, Sep 25 2013, 19:29:01)
[GCC 4.7.3] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import mock
>>> import subprocess
>>> result = subprocess.call('date')
Fri Jan 3 19:45:32 CET 2014
>>> subprocess.call = mock.create_autospec(subprocess.call, return_value='mocked!')
>>> result = subprocess.call('date')
>>> print(result)
mocked!
>>> subprocess.call.mock_calls
[call('date')]
I believe this question is about the usage of this particular mock package
General statements, unrelated to your direct question
Wrote this up before I understood that the question is specifically about the use of the python mock package.
One general way to mock functions is to explicitly redefine the function or method:
$ python3.3
Python 3.3.1 (default, Sep 25 2013, 19:29:01)
[GCC 4.7.3] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import subprocess
>>> subprocess.call('date')
Fri Jan 3 19:23:25 CET 2014
0
>>> def mocked_call(*a, **kw):
... return 'mocked'
...
>>> subprocess.call = mocked_call
>>> subprocess.call('date')
'mocked'
The big advantage of this straightforward approach is that this is free of any package dependencies. The disadvantage is that if there are specific needs, all the decision making logic has to be coded manually.
As an example of mocking packages, FlexMock is available for both Python 2.7 and Python 3.* and its usage of overriding subprocess.call is discussed in this question
This work for subprocess.check_output in python3
#mock.patch('subprocess.check_output', mock.mock_open())
#mock.patch('subprocess.Popen.communicate')
def tst_prepare_data_for_matrices(self, makedirs_mock, check_output_mock):
config_file = open(os.path.abspath(os.path.join(os.path.dirname(__file__), os.pardir)+'/etc/test/config.json')).read()
check_output_mock.return_value = ("output", "Error")

How do I use python interface of Stanford NER(named entity recogniser)?

I want to use Stanford NER in python using pyner library. Here is one basic code snippet.
import ner
tagger = ner.HttpNER(host='localhost', port=80)
tagger.get_entities("University of California is located in California, United States")
When I run this on my local python console(IDLE). It should have given me an output like this
{'LOCATION': ['California', 'United States'],
'ORGANIZATION': ['University of California']}
but when I execut this, it showed empty brackets. I am actually new to all this.
I am able to run the stanford-ner server in socket mode using:
java -mx1000m -cp stanford-ner.jar edu.stanford.nlp.ie.NERServer \
-loadClassifier classifiers/english.muc.7class.distsim.crf.ser.gz \
-port 8080 -outputFormat inlineXML
and receive the following output from the command line:
Loading classifier from
/Users/roneill/stanford-ner-2012-11-11/classifiers/english.muc.7class.distsim.crf.ser.gz
... done [1.7 sec].
Then in python repl:
Python 2.7.2 (default, Jun 20 2012, 16:23:33)
[GCC 4.2.1 Compatible Apple Clang 4.0 (tags/Apple/clang-418.0.60)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import ner
>>> tagger = ner.SocketNER(host='localhost', port=8080)
>>> tagger.get_entities("University of California is located in California, United States")
{'ORGANIZATION': ['University of California'], 'LOCATION': ['California', 'United States']}

regexp for phone (555) 555-9087

I'm using the pattern attribute for my <input type="tel" /> and I'm having a hard time with the regexp. I tried pattern="d{10]" and pattern="d{3}[\)]\d{3}[\-]\d{4}" but it does not work.
Use this instead:
pattern="\(\d{3}\) \d{3}-\d{4}"
Here's the fiddle: http://jsfiddle.net/ZMaXA/
Try messing with the value in the input.
Python 2.7.3 (default, Aug 1 2012, 05:14:39)
[GCC 4.6.3] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import re
>>> str='(777) 777-9087'
>>> r=re.search('\(([^\)]+)\)\s*(\d+-\d+)',str)
>>> r.groups()
('777', '777-9087')