In python how can I convert regex match type (sre.SRE_Match) to FLOATS? [duplicate] - regex

I am trying to use a regular expression to extract words inside of a pattern.
I have some string that looks like this
someline abc
someother line
name my_user_name is valid
some more lines
I want to extract the word my_user_name. I do something like
import re
s = #that big string
p = re.compile("name .* is valid", re.flags)
p.match(s) # this gives me <_sre.SRE_Match object at 0x026B6838>
How do I extract my_user_name now?

You need to capture from regex. search for the pattern, if found, retrieve the string using group(index). Assuming valid checks are performed:
>>> p = re.compile("name (.*) is valid")
>>> result = p.search(s)
>>> result
<_sre.SRE_Match object at 0x10555e738>
>>> result.group(1) # group(1) will return the 1st capture (stuff within the brackets).
# group(0) will returned the entire matched text.
'my_user_name'

You can use matching groups:
p = re.compile('name (.*) is valid')
e.g.
>>> import re
>>> p = re.compile('name (.*) is valid')
>>> s = """
... someline abc
... someother line
... name my_user_name is valid
... some more lines"""
>>> p.findall(s)
['my_user_name']
Here I use re.findall rather than re.search to get all instances of my_user_name. Using re.search, you'd need to get the data from the group on the match object:
>>> p.search(s) #gives a match object or None if no match is found
<_sre.SRE_Match object at 0xf5c60>
>>> p.search(s).group() #entire string that matched
'name my_user_name is valid'
>>> p.search(s).group(1) #first group that match in the string that matched
'my_user_name'
As mentioned in the comments, you might want to make your regex non-greedy:
p = re.compile('name (.*?) is valid')
to only pick up the stuff between 'name ' and the next ' is valid' (rather than allowing your regex to pick up other ' is valid' in your group.

You could use something like this:
import re
s = #that big string
# the parenthesis create a group with what was matched
# and '\w' matches only alphanumeric charactes
p = re.compile("name +(\w+) +is valid", re.flags)
# use search(), so the match doesn't have to happen
# at the beginning of "big string"
m = p.search(s)
# search() returns a Match object with information about what was matched
if m:
name = m.group(1)
else:
raise Exception('name not found')

You can use groups (indicated with '(' and ')') to capture parts of the string. The match object's group() method then gives you the group's contents:
>>> import re
>>> s = 'name my_user_name is valid'
>>> match = re.search('name (.*) is valid', s)
>>> match.group(0) # the entire match
'name my_user_name is valid'
>>> match.group(1) # the first parenthesized subgroup
'my_user_name'
In Python 3.6+ you can also index into a match object instead of using group():
>>> match[0] # the entire match
'name my_user_name is valid'
>>> match[1] # the first parenthesized subgroup
'my_user_name'

Maybe that's a bit shorter and easier to understand:
import re
text = '... someline abc... someother line... name my_user_name is valid.. some more lines'
>>> re.search('name (.*) is valid', text).group(1)
'my_user_name'

You want a capture group.
p = re.compile("name (.*) is valid", re.flags) # parentheses for capture groups
print p.match(s).groups() # This gives you a tuple of your matches.

Here's a way to do it without using groups (Python 3.6 or above):
>>> re.search('2\d\d\d[01]\d[0-3]\d', 'report_20191207.xml')[0]
'20191207'

You can also use a capture group (?P<user>pattern) and access the group like a dictionary match['user'].
string = '''someline abc\n
someother line\n
name my_user_name is valid\n
some more lines\n'''
pattern = r'name (?P<user>.*) is valid'
matches = re.search(pattern, str(string), re.DOTALL)
print(matches['user'])
# my_user_name

I found this answer via google because I wanted to unpack a re.search() result with multiple groups directly into multiple variables. While this might be obvious for some, it was not for me because I always used group() in the past, so maybe it helps someone in the future who also did not know about group*s*().
s = "2020:12:30"
year, month, day = re.search(r"(\d+):(\d+):(\d+)", s).groups()

It seems like you're actually trying to extract a name vice simply find a match. If this is the case, having span indexes for your match is helpful and I'd recommend using re.finditer. As a shortcut, you know the name part of your regex is length 5 and the is valid is length 9, so you can slice the matching text to extract the name.
Note - In your example, it looks like s is string with line breaks, so that's what's assumed below.
## covert s to list of strings separated by line:
s2 = s.splitlines()
## find matches by line:
for i, j in enumerate(s2):
matches = re.finditer("name (.*) is valid", j)
## ignore lines without a match
if matches:
## loop through match group elements
for k in matches:
## get text
match_txt = k.group(0)
## get line span
match_span = k.span(0)
## extract username
my_user_name = match_txt[5:-9]
## compare with original text
print(f'Extracted Username: {my_user_name} - found on line {i}')
print('Match Text:', match_txt)

Related

I want to match two Urls in Python using regular expressions, but following code gives an error. I cannot figure out why

input = 'susaya https://sousfs#sls.sus.uk/de/sekd/sho/project1/first_project'
url_match = re.match("\s*susaya\s+([^ ]+)", input)
When I try to print url_match, I get the memory location.
print url_match
None
<_sre.SRE_Match object at 0x5f630cs48e40>
What is the output of regular expression ("\s*susaya\s+([^ ]+)?
I get None when I try to print because url_match doesn't match?
I am using python2.7. Thanks.
re.match doesn't return a string, it returns a match object. Calling group(i) on the match object returns the i'th capture group, with the 0th capture group being the entire match.
>>> input = "susaya https://sousfs#sls.sus.uk/de/sekd/sho/project1/first_project"
>>> url_match = re.match(r"\s*susaya\s+([^ ]+)", input)
>>> url_match.group(0)
'susaya https://sousfs#sls.sus.uk/de/sekd/sho/project1/first_project'
The pattern "\s*susaya\s+([^ ]+)" matches zero or more spaces, followed by "susaya", followed by one or more spaces, followed by a capture group of one or more characters that are not spaces.

Regular expression to get a group of text

I cant find the right regular expression:
print(re.compile(r'row_([0-9]+)(_[^_]+)*').split('row_0007_id_testa_testb'))
> ['', '0007', '_testb', '']
I tried with non-greedy regexp, didn't work too:
print(re.compile(r'row_([0-9]+)(_[^_]+)+?').split('row_0007_id_testa_testb'))
['', '0007', '_id', '_testa_testb']
I need to get this:
> ['', '0007', 'id', 'testa', 'testb']
You can use a simple regex _([^_]+) in findall with an inline if condition to assert that string starts with row_:
>>> reg = re.compile(r'_([^_]+)')
>>> s = 'row_0007_id_testa_testb'
>>> print re.findall(reg, s) if s.startswith('row_') else None
['0007', 'id', 'testa', 'testb']
>>> s = 'col_0007_id_testa_testb'
>>> print re.findall(reg, s) if s.startswith('row_') else None
None
You did't indicate the host programming language, but I assume
it is Python.
As I see, you want to split the source text on _ character, so
just _ should be the content of the regex.
p = re.compile('_').split('row_0007_id_testa_testb')
gives sets ['row', '0007', 'id', 'testa', 'testb'] in p.
So the only thing to change is to set the starting element to an
empty string. Then you can print the p array, getting the
expected result.
Below you have the example script:
import re
p = re.compile('_').split('row_0007_id_testa_testb')
print p
p[0] = ''
print p
Assuming you want to match strings that contains letters and numbers only, but replace the first match with an empty string. If so, use
re.compile(r'^[^\W_]+_+|[\W_]+').split('row_0007_id_testa_testb')
Output:
['', '0007', 'id', 'testa', 'testb']
Test this code here.
Maybe try using a single re.split command:
re.split(r"^row_|_", "row_0007_id_testa_testb)
For a more generalized replace:
re.split(r"^[a-z]+_|_", "row_0007_id_testa_testb")
This would eliminate the first word characters preceded by the first _ and then split the rest.
It's unclear by the question, although if looking to prepend the split string list this should work:
r = re.split(r"_", s)[1:]
r.insert(0,'')
You could use an alternation | with findall, which would match row_ and capture not an underscore in a capturing group ([^_]+)
row_|([^_]+)
import re
print(re.findall(r"row_|([^_]+)", 'row_0007_id_testa_testb'))
That would give you:
['', '0007', 'id', 'testa', 'testb']
Demo

python3: regex need to character to match but dont want in output

I have a string named
Set-Cookie: BIGipServerApp_Pool_SSL=839518730.47873.0000; path=/
I am trying to extract 839518730.47873.0000 from it. For exact string I am fine with my regex but If I include any digit before 1st = then its all going wrong.
No Digit
>>> m=re.search('[0-9.]+','Set-Cookie: BIGipServerApp_Pool_SSL=839518730.47873.0000; path=/')
>>> m.group()
'839518730.47873.0000'
With Digit
>>> m=re.search('[0-9.]+','Set-Cookie: BIGipServerApp_Pool_SSL2=839518730.47873.0000; path=/')
>>> m.group()
'2'
Is there any way I can extract `839518730.47873.0000' only but doesnt matter what else lies in the string.
I tried
>>> m=re.search('=[0-9.]+','Set-Cookie: BIGipServerApp_Pool_SSL=839518730.47873.0000; path=/')
>>> m.group()
'=839518730.47873.0000'
As well but its starting with '=' in the output and I dont want it.
Any ideas.
Thank you.
If your substring always comes after the first =, you can just use capture group with =([\d.]+) pattern:
import re
result = ""
m = re.search(r'=([0-9.]+)','Set-Cookie: BIGipServerApp_Pool_SSL2=839518730.47873.0000; path=/')
if m:
result = m.group(1) # Get Group 1 value only
print(result)
See the IDEONE demo
The main point is that you match anything you do not need and match and capture (with the unescaped round brackets) the part of pattern you need. The value you need is in Group 1.
You can use word boundaries:
\b[\d.]+
RegEx Demo
Or to make match more targeted use lookahead for next semi-colon after your matched text:
\b[\d.]+(?=\s*;)
RegEx Demo2
Update :
>>> m.group(0)
'839518730.47873.0000'
>>> m=re.search(r'\b[\d.]+','Set-Cookie: BIGipServerApp_Pool_SSL2=839518730.47873.0000; path=/')
>>> m.group(0)
'839518730.47873.0000'
>>>

RegEx pattern returning all words except those in parenthesis

I have a text of the form:
können {konnte, gekonnt} Verb
And I want to get a match for all words in it that are not in parenthesis. That means:
können = 1st match, Verb = 2nd match
Unfortunately I still don't get the knock of regular expression. There is a lot of testing possibility but not much help for creation unless you want to read a book.
I will use them in Java or Python.
In Python you could do this:
import re
regex = re.compile(r'(?:\{.*?\})?([^{}]+)', re.UNICODE)
print 'Matches: %r' % regex.findall(u'können {konnte, gekonnt} Verb')
Result:
Matches: [u'können ', u' Verb']
Although I would recommend simply replacing everything between { and } like so:
import re
regex = re.compile(r'\{.*?\}', re.UNICODE)
print 'Output string: %r' % regex.sub('', u'können {konnte, gekonnt} Verb')
Result:
Output string: u'können Verb'
A regex SPLIT using this pattern will do the job:
(\s+|\s*{[^}]*\}\s*)
and ignore any empty value.

String separation in required format, Pythonic way? (with or w/o Regex)

I have a string in the format:
t='#abc #def Hello this part is text'
I want to get this:
l=["abc", "def"]
s='Hello this part is text'
I did this:
a=t[t.find(' ',t.rfind('#')):].strip()
s=t[:t.find(' ',t.rfind('#'))].strip()
b=a.split('#')
l=[i.strip() for i in b][1:]
It works for the most part, but it fails when the text part has the '#'.
Eg, when:
t='#abc #def My email is red#hjk.com'
it fails. The #names are there in the beginning and there can be text after #names, which may possibly contain #.
Clearly I can append initally with a space and find out first word without '#'. But that doesn't seem an elegant solution.
What is a pythonic way of solving this?
Building unashamedly on MrTopf's effort:
import re
rx = re.compile("((?:#\w+ +)+)(.*)")
t='#abc #def #xyz Hello this part is text and my email is foo#ba.r'
a,s = rx.match(t).groups()
l = re.split('[# ]+',a)[1:-1]
print l
print s
prints:
['abc', 'def', 'xyz']
Hello this part is text and my email is foo#ba.r
Justly called to account by hasen j, let me clarify how this works:
/#\w+ +/
matches a single tag - # followed by at least one alphanumeric or _ followed by at least one space character. + is greedy, so if there is more than one space, it will grab them all.
To match any number of these tags, we need to add a plus (one or more things) to the pattern for tag; so we need to group it with parentheses:
/(#\w+ +)+/
which matches one-or-more tags, and, being greedy, matches all of them. However, those parentheses now fiddle around with our capture groups, so we undo that by making them into an anonymous group:
/(?:#\w+ +)+/
Finally, we make that into a capture group and add another to sweep up the rest:
/((?:#\w+ +)+)(.*)/
A last breakdown to sum up:
((?:#\w+ +)+)(.*)
(?:#\w+ +)+
( #\w+ +)
#\w+ +
Note that in reviewing this, I've improved it - \w didn't need to be in a set, and it now allows for multiple spaces between tags. Thanks, hasen-j!
t='#abc #def Hello this part is text'
words = t.split(' ')
names = []
while words:
w = words.pop(0)
if w.startswith('#'):
names.append(w[1:])
else:
break
text = ' '.join(words)
print names
print text
How about this:
Splitting by space.
foreach word, check
2.1. if word starts with # then Push to first list
2.2. otherwise just join the remaining words by spaces.
You might also use regular expressions:
import re
rx = re.compile("#([\w]+) #([\w]+) (.*)")
t='#abc #def Hello this part is text and my email is foo#ba.r'
a,b,s = rx.match(t).groups()
But this all depends on how your data can look like. So you might need to adjust it. What it does is basically creating group via () and checking for what's allowed in them.
[i.strip('#') for i in t.split(' ', 2)[:2]] # for a fixed number of #def
a = [i.strip('#') for i in t.split(' ') if i.startswith('#')]
s = ' '.join(i for i in t.split(' ') if not i.startwith('#'))
[edit: this is implementing what was suggested by Osama above]
This will create L based on the # variables from the beginning of the string, and then once a non # var is found, just grab the rest of the string.
t = '#one #two #three some text afterward with # symbols# meow#meow'
words = t.split(' ') # split into list of words based on spaces
L = []
s = ''
for i in range(len(words)): # go through each word
word = words[i]
if word[0] == '#': # grab #'s from beginning of string
L.append(word[1:])
continue
s = ' '.join(words[i:]) # put spaces back in
break # you can ignore the rest of the words
You can refactor this to be less code, but I'm trying to make what is going on obvious.
Here's just another variation that uses split() and no regexpes:
t='#abc #def My email is red#hjk.com'
tags = []
words = iter(t.split())
# iterate over words until first non-tag word
for w in words:
if not w.startswith("#"):
# join this word and all the following
s = w + " " + (" ".join(words))
break
tags.append(w[1:])
else:
s = "" # handle string with only tags
print tags, s
Here's a shorter but perhaps a bit cryptic version that uses a regexp to find the first space followed by a non-# character:
import re
t = '#abc #def My email is red#hjk.com #extra bye'
m = re.search(r"\s([^#].*)$", t)
tags = [tag[1:] for tag in t[:m.start()].split()]
s = m.group(1)
print tags, s # ['abc', 'def'] My email is red#hjk.com #extra bye
This doesn't work properly if there are no tags or no text. The format is underspecified. You'll need to provide more test cases to validate.