Regular expression to group a pattern OR group empty string as "" - regex

I'm using Python 3.3.2 with regular expressions. I have a pretty simple function
def DoRegexThings(somestring):
m = re.match(r'(^\d+)( .*$)?', somestring)
return m.group(1)
Which I am using to just get a numeric portion at the beginning of string, and discard the rest. However, it fails on the case of an empty string, since it is unable to match a group.
I've looked at this similar question which was asked previously, and changed my regular expression to this:
(^$)|(^\d+)( .*$)?
But it only causes it to return "None" every time, and still fails on empty strings. What I really want is a regular expression which I can use to either grab the numeric portion of my record, e.g. if the record is 1234 sometext, I just want 1234, or if the string is empty I want m.group(1) to return an empty string. My workaround right now is
m = re.match(r'(^\d+)( .*$)?', somestring)
if m == None: # Handle empty string case
return somestring
else:
return m.group(1)
But if I can avoid checking the match object for None, I'd like to. Is there a way to accomplish this?

I think you're making this overly complicated:
re.match(r"\d*", somestring).group()
will return a number if it's at the start of the string (.match() ensures this) or the empty string if there is no number.
>>> import re
>>> somestring = "987kjh"
>>> re.match(r"\d*", somestring).group()
'987'
>>> somestring = "kjh"
>>> re.match(r"\d*", somestring).group()
''

Related

Entire text is matched but not able to group in named groups

I have following example text:
my_app|key1=value1|user_id=testuser|ip_address=10.10.10.10
I want to extract sub-fields from it in following way:
appName = my_app,
[
{key = key1, value = value1},
{key = user_id, value = testuser},
{key = ip_address, value = 10.10.10.10}
]
I have written following regex for doing this:
(?<appName>\w+)\|(((?<key>\w+)?(?<equals>=)(?<value>[^\|]+))\|?)+
It matches the entire text but is not able to group it correctly in named groups.
Tried testing it on https://regex101.com/
What am I missing here?
I think the main problem you have is trying to write a regex that matches ALL the key=value pairs. That's not the way to do it. The correct way is based on a pattern that matches ONLY ONE key=value, but is applied by a function that finds all accurances of the pattern. Every languages supplies such a function. Here's the code in Python for example:
import re
txt = 'my_app|key1=value1|user_id=testuser|ip_address=10.10.10.10'
pairs = re.findall(r'(\w+)=([^|]+)', txt)
print(pairs)
This gives:
[('key1', 'value1'), ('user_id', 'testuser'), ('ip_address', '10.10.10.10')]
The pattern matches a key consisting of alpha-numeric chars - (\w+) with a value. The value is designated by ([^|]+), that is everything but a vertical line, because the value can have non-alpha numeric values, such a dot in the ip address.
Mind the findall function. There's a search function to catch a pattern once, and there's a findall function to catch all the patterns within the text.
I tested it on regex101 and it worked.
I must comment, though, that the specific text pattern you work on doesn't require regex. All high level languages supply a split function. You can split by vertical line, and then each slice you get (expcept the first one) you split again by the equal sign.
Use the PyPi regex module with the following code:
import regex
s = "my_app|key1=value1|user_id=testuser|ip_address=10.10.10.10"
rx = r"(?<appName>\w+)(?:\|(?<key>\w+)=(?<value>[^|]+))+"
print( [(m.group("appName"), dict(zip(m.captures("key"),m.captures("value")))) for m in regex.finditer(rx, s)] )
# => [('my_app', {'ip_address': '10.10.10.10', 'key1': 'value1', 'user_id': 'testuser'})]
See the Python demo online.
The .captures property contains all the values captured into a group at all the iterations.
Not sure, but maybe regular expression might be unnecessary, and splitting similar to,
data='my_app|key1=value1|user_id=testuser|ip_address=10.10.10.10'
x= data.split('|')
appName = []
for index,item in enumerate(x):
if index>0:
element = item.split('=')
temp = {"key":element[0],"value":element[1]}
appName.append(temp)
appName = str(x[0] + ',' + str(appName))
print(appName)
might return an output similar to the desired output:
my_app,[{'key': 'key1', 'value': 'value1'}, {'key': 'user_id', 'value': 'testuser'}, {'key': 'ip_address', 'value': '10.10.10.10'}]
using dict:
temp = {"key":element[0],"value":element[1]}
temp can be modified to other desired data structure that you like to have.

Scala regex find/replace with additional formatting

I'm trying to replace parts of a string that contains what should be dates, but which are possibly in an impermissible format. Specifically, all of the dates are in the form "mm/dd/YYYY" and they need to be in the form "YYYY-mm-dd". One caveat is that the original dates may not exactly be in the mm/dd/YYYY format; some are like "5/6/2015". For example, if
val x = "where date >= '05/06/2017'"
then
x.replaceAll("'([0-9]{1,2})/([0-9]{1,2})/([0-9]{4})'", "'$3-$1-$2'")
performs the desired replacement (returns "2017-05-06"), but for
val y = "where date >= '5/6/2017'"
this does not return the desired replacement (returns "2017-5-6" -- for me, an invalid representation). With the Joda Time wrapper nscala-time, I've tried capturing the dates and then reformatting them:
import com.github.nscala_time.time.Imports._
import org.joda.time.DateTime
val f = DateTimeFormat.forPattern("yyyy-MM-dd")
y.replaceAll("'([0-9]{1,2}/[0-9]{1,2}/[0-9]{4})'",
"'"+f.print(DateTimeFormat.forPattern("MM/dd/yyyy").parseDateTime("$1"))+"'")
But this fails with a java.lang.IllegalArgumentException: Invalid format: "$1". I've also tried using the f interpolator and padding with 0s, but it doesn't seem to like that either.
Are you not able to do additional processing on the captured groups ($1, etc.) inside the replaceAll? If not, how else can I achieve the desired result?
The $1 like backreferences can only be used inside string replacement patterns. In your code, "$1" is not a backreference any longer.
You may use a "callback" with replaceAllIn to actually get the match object and access its groups to further manipulate them:
val pattern = "'([0-9]{1,2}/[0-9]{1,2}/[0-9]{4})'".r
y = pattern replaceAllIn (y, m => "'"+f.print(DateTimeFormat.forPattern("MM/dd/yyyy").parseDateTime(m.group(1)))+"'")
Regex.replaceAllIn is overloaded and can take a Match => String.

scrapy/xpaths/regex: proper xpath/re to ignore "link interjections"

I am scraping some korean language text and I come across a lot of "link interjection" for lack of a better word, where the html looks like this...
는 좋아요
it shows '저' as a hyperlink but the '는 좋아요' as regular text. They are in reality part of the same word object and display on the page as '저는 좋아요요' but when scraping using this xpath and regex...
foo = response.xpath('//*[#id="divID"]/p//text()').re(ur'[\uac00-\ud7af]+')
it is broken into two words in a list...
foo == ['저', '는', '좋아요']
How can I get this to keep it as one word, as was my original intent?
intended: foo == ['좋는', '좋아요']
EDIT: (comment response)
the problem with .join() is that it will join all the regularly scraped words as well as far as I can tell. So I would end up with this...
''.join(foo) == ['좋는좋아요']
So I do not think that .join() will work unless there is something I am missing
If you want to work on the string representation of an HTML element, XPath has a string() function that can be very helpful.
Once you have a single string for the element, you can apply regular expressions for words.
Here's a sample python interpreter session (I had to change your markup a bit to match the results you showed):
>>> import scrapy
>>>
>>> response = scrapy.Selector(text=u'<p>저는 좋아요</p>')
.//text() will select all descendant text nodes, as individual strings when .extract()ed (2 strings in this case):
>>> response.xpath('.//p//text()').extract()
[u'\uc800', u'\ub294 \uc88b\uc544\uc694']
And with the regex, you'll find 1 word, then 2 words:
>>> response.xpath('.//p//text()').re(ur'[\uac00-\ud7af]+')
[u'\uc800', u'\ub294', u'\uc88b\uc544\uc694']
>>> for e in response.xpath('.//p//text()').re(ur'[\uac00-\ud7af]+'):
... print e
...
저
는
좋아요
If you use XPath string() function on the paragraph element, you get a single string, even if the element has other children like a:
>>> response.xpath('string(.//p)').extract()
[u'\uc800\ub294 \uc88b\uc544\uc694']
>>> print response.xpath('string(.//p)').extract_first()
저는 좋아요
And you can then apply your regular expression to split on words:
>>> response.xpath('string(.//p)').re(ur'[\uac00-\ud7af]+')
[u'\uc800\ub294', u'\uc88b\uc544\uc694']
>>> for e in response.xpath('string(.//p)').re(ur'[\uac00-\ud7af]+'):
... print e
...
저는
좋아요
Note that string(node-set) only considers the first element in the node-set you pass as argument, so make sure your XPath expression first matches the element you want, or you can also chain XPath expression with scrapy selectors:
>>> for e in response.xpath('.//p').xpath('string(.)').re(ur'[\uac00-\ud7af]+'):
... print e
...
저는
좋아요

Python: get the string between two capitals

I'd like your opinion as you might be more experienced on Python as I do.
I came from C++ and I'm still not used to the Pythonic way to do things.
I want to loop under a string, between 2 capital letters. For example, I could do that this way:
i = 0
str = "PythonIsFun"
for i, z in enumerate(str):
if(z.isupper()):
small = ''
x = i + 1
while(not str[x].isupper()):
small += str[x]
I wrote this on my phone, so I don't know if this even works but you caught the idea, I presume.
I need you to help me get the best results on this, not just in a non-forced way to the cpu but clean code too. Thank you very much
This is one of those times when regexes are the best bet.
(And don't call a string str, by the way: it shadows the built-in function.)
s = 'PythonIsFun'
result = re.search('[A-Z]([a-z]+)[A-Z]', s)
if result is not None:
print result.groups()[0]
you could use regular expressions:
import re
re.findall ( r'[A-Z]([^A-Z]+)[A-Z]', txt )
outputs ['ython'], and
re.findall ( r'(?=[A-Z]([^A-Z]+)[A-Z])', txt )
outputs ['ython', 's']; and if you just need the first match,
re.search ( r'[A-Z]([^A-Z]+)[A-Z]', txt ).group( 1 )
You can use a list comprehension to do this easily.
>>> s = "PythonIsFun"
>>> u = [i for i,x in enumerate(s) if x.isupper()]
>>> s[u[0]+1:u[1]]
'ython'
If you can't guarantee that there are two upper case characters you can check the length of u to make sure it is at least 2. This does iterate over the entire string, which could be a problem if the two upper case characters occur at the start of a lengthy string.
There are many ways to tackle this, but I'd use regular expressions.
This example will take "PythonIsFun" and return "ythonsun"
import re
text = "PythonIsFun"
pattern = re.compile(r'[a-z]') #look for all lower-case characters
matches = re.findall(pattern, text) #returns a list of lower-chase characters
lower_string = ''.join(matches) #turns the list into a string
print lower_string
outputs:
ythonsun

Take first successful match from a batch of regexes

I'm trying to extract set of data from a string that can match one of three patterns. I have a list of compiled regexes. I want to run through them (in order) and go with the first match.
regexes = [
compiled_regex_1,
compiled_regex_2,
compiled_regex_3,
]
m = None
for reg in regexes:
m = reg.match(name)
if m: break
if not m:
print 'ARGL NOTHING MATCHES THIS!!!'
This should work (haven't tested yet) but it's pretty fugly. Is there a better way of boiling down a loop that breaks when it succeeds or explodes when it doesn't?
There might be something specific to re that I don't know about that allows you to test multiple patterns too.
You can use the else clause of the for loop:
for reg in regexes:
m = reg.match(name)
if m: break
else:
print 'ARGL NOTHING MATCHES THIS!!!'
If you just want to know if any of the regex match then you could use the builtin any function:
if any(reg.match(name) for reg in regexes):
....
however this will not tell you which regex matched.
Alternatively you can combine multiple patterns into a single regex with |:
regex = re.compile(r"(regex1)|(regex2)|...")
Again this will not tell you which regex matched, but you will have a match object that you can use for further information. For example you can find out which of the regex succeeded from the group that is not None:
>>> match = re.match("(a)|(b)|(c)|(d)", "c")
>>> match.groups()
(None, None, 'c', None)
However this can get complicated however if any of the sub-regex have groups in them as well, since the numbering will be changed.
This is probably faster than matching each regex individually since the regex engine has more scope for optimising the regex.
Since you have a finite set in this case, you could use short ciruit evaluation:
m = compiled_regex_1.match(name) or
compiled_regex_2.match(name) or
compiled_regex_3.match(name) or
print("ARGHHHH!")
In Python 2.6 or better:
import itertools as it
m = next(it.ifilter(None, (r.match(name) for r in regexes)), None)
The ifilter call could be made into a genexp, but only a bit awkwardly, i.e., with the usual trick for name binding in a genexp (aka the "phantom nested for clause idiom"):
m = next((m for r in regexes for m in (r.match(name),) if m), None)
but itertools is generally preferable where applicable.
The bit needing 2.6 is the next built-in, which lets you specify a default value if the iterator is exhausted. If you have to simulate it in 2.5 or earlier,
def next(itr, deft):
try: return itr.next()
except StopIteration: return deft
I use something like Dave Kirby suggested, but add named groups to the regexps, so that I know which one matched.
regexps = {
'first': r'...',
'second': r'...',
}
compiled = re.compile('|'.join('(?P<%s>%s)' % item for item in regexps.iteritems()))
match = compiled.match(my_string)
print match.lastgroup
Eric is in better track in taking bigger picture of what OP is aiming, I would use if else though. I would also think that using print function in or expression is little questionable. +1 for Nathon of correcting OP to use proper else statement.
Then my alternative:
# alternative to any builtin that returns useful result,
# the first considered True value
def first(seq):
for item in seq:
if item: return item
regexes = [
compiled_regex_1,
compiled_regex_2,
compiled_regex_3,
]
m = first(reg.match(name) for reg in regexes)
print(m if m else 'ARGL NOTHING MATCHES THIS!!!')