Regex and capturing groups - regex

I have my regex working to the point where I now have two groups of text, group 1 and group 2 - I'm only really interested in the group2 text. In the end how do I get just group 2 to match/display?
("token":")([^,]+)

Something like below should work:
>>> p = re.compile('("token":")([^,]+)')
>>> m = p.match('...')
>>> m.group(2)
This will get the content of the second group. (Taken from here)

Related

Different ouput for pd.str.extract() and re.search()

As seen in my previous question
Rename columns regex, keep name if no match
Why is there a different output of the regex?
data = {'First_Column': [1,2,3], 'Second_Column': [1,2,3],
'\First\Mid\LAST.Ending': [1,2,3], 'First1\Mid1\LAST1.Ending': [1,2,3]}
df = pd.DataFrame(data)
First_Column Second_Column \First\Mid\LAST.Ending First1\Mid1\LAST1.Ending
pd.str.extract()
df.columns.str.extract(r'([^\\]+)\.Ending')
0
0 NaN
1 NaN
2 LAST
3 LAST1
re.search()
col = df.columns.tolist()
for i in col[2:]:
print(re.search(r'([^\\]+)\.Ending', i).group())
LAST.Ending
LAST1.Ending
THX
From pandas.Series.str.extract docs
Extract capture groups in the regex pat as columns in a DataFrame.
It returns the capture group. Whereas, re.search with group() or group(0) returns the whole match, but if you change to group(1) it will return the capture group 1.
This will return full match:
for i in col[2:]:
print(re.search(r'([^\\]+)\.Ending', i).group())
LAST.Ending
LAST1.Ending
This will return only the capture group:
for i in col[2:]:
print(re.search(r'([^\\]+)\.Ending', i).group(1))
LAST
LAST1
Further read Link

Regex currency Python 3.5

I am trying to reformat the Euro currency in text data. The original format is like this: EUR 3.000.00 or also EUR 33.540.000.- .
I want to standardise the format to €3000.00 or €33540000.00.
I have reformatted EUR 2.500.- successfully using this code:
import re
format1 = "a piece of text with currency EUR 2.500.- and some other information"
regexObj = re.compile(r'EUR\s*\d{1,3}[.](\d{3}[.]-)')
text1 = regexObj.sub(lambda m:"\u20ac"+"{:0.2f}".format(float(re.compile('\d+(.\d+)?(\.\d+)?').search(m.group().replace('.','')).group())),format1)
Out: "a piece of text with currency €2500.00 and some other information"
This gives me €2500.00 which is correct. I've tried applying the same logic to the other formats to no avail.
format2 = "another piece of text EUR 3.000.00 and EUR 5.000.00. New sentence"
regexObj = re.compile('\d{1,3}[.](\d{3}[.])(\d{2})?')
text2 = regexObj.sub(lambda m:"\u20ac"+"{:0.2f}".format(float(re.compile('\d+(.\d+)?(\.\d+)?').search(m.group().replace('.','')).group())),format2)
Out: "another piece of text EUR €300000.00 and EUR €500000.00. New sentence"
and
format3 = "another piece of text EUR 33.540.000.- and more text"
regexObj = regexObj = re.compile(r'EUR\s*\d{1,3}[.](\d{3}[.])(\d{3}[.])(\d{3}[.]-)')
text3 = regexObj.sub(lambda m:"\u20ac"+"{:0.2f}".format(float(re.compile('\d+(.\d+)?(.\d+)?').search(m.group().replace('.','')).group())),format3)
Out: "another piece of text EUR 33.540.000.- and more text"
I think the problem might be with the regexObj.sub(), as the .format() part of it is confusing me. I've tried to change re.compile('\d+(.\d+)?(.\d+)?') within that, but I can't seem to generate the result I want. Any ideas much appreciated. Thanks!
Let's start with the regex. My propositions is:
EUR\s*(?:(\d{1,3}(?:\.\d{3})*)\.-|(\d{1,3}(?:\.\d{3})*)(\.\d{2}))
Details:
EUR\s* - The starting part.
(?: - Start of a non-capturing group - a container for alternatives.
( - Start of a capturing group #1 (Integer part with ".-" instead of
the decimal part).
\d{1,3} - Up to 3 digits.
(?:\.\d{3})* - ".ddd" part, 0 or more times.
) - End of group #1.
\.- - ".-" ending.
| - Alternative separator.
( - Start of a capturing group #2 (Integer part)
\d{1,3}(?:\.\d{3})* - Like in alternative 1.
) - End of group #2.
(\.\d{2}) - Capturing group #3 (dot and decimal part).
) - End of the non-capturing group.
Instead of a lambda function I used "ordinary" replicating function,
I called it repl. It contains 2 parts, for group 1 and group 2 + 3.
In both variants dots from the integer part are deleted, but the "final"
dot (after the integer part) is a part of group 3, so it is not deleted.
So the whole script can look like below:
import re
def repl(m):
g1 = m.group(1)
if g1: # Alternative 1: ".-" instead of decimal part
res = g1.replace('.','') + '.00'
else: # Alternative 2: integet part (group 2) + decimal part (group 3)
res = m.group(2).replace('.','') + m.group(3)
return "\u20ac" + res
# Source string
src = 'xxx EUR 33.540.000.- yyyy EUR 3.000.00 zzzz EUR 5.210.15 vvvv'
# Regex
pat = re.compile(r'EUR\s*(?:(\d{1,3}(?:\.\d{3})*)\.-|(\d{1,3}(?:\.\d{3})*)(\.\d{2}))')
# Replace
result = pat.sub(repl, src)
The result is:
xxx €33540000.00 yyyy €3000.00 zzzz €5210.15 vvvv
As you can see, no need to use float or format.

Extract filename and id from its name

I have a file with text
# co2a0000123.rd
# co2c0000124.rd
I need to use regex and extract co2a0000123 in group 1 and a or c as highlighted in group 2 of regex expression
I have tried
(\B[a|c])([a-z0-9]+).(?:[a-z]+)
What happens is ([a-z0-9]+).(?:[a-z]+) this part of regex gives co2a0000123 in group 1 as desired but as soon as I add (\B[a|c]) in the beginning or end co2a0000123 changes to co2a in group 1 and gives 'a' in Group 2.
Try for example \s(\w+?([ac])\w*)\.
Group 1 will be the part between a space and a dot.
Group 2 will be the first a or c anywhere except the first letter within Group 1.

[Python3]RegEx to match multiple strings

I am trying to match multiple stings, which also includes an optional capture group.
My RegEx:
(\[[A-Za-z]*\])(.*) - (.*)(.[0-9]{2}\.[0-9]{2}\.[0-9]{2}.)?(\[.*\])
Strings:
[Test]Kyubiikitsune - Company Of Wolves[20.06.96][Hi-Res]
[TEst]_ANother - Company Of 2[Hi-Res]
[Yes]coOl__ - some text_[20.06.96][Hi-Res]
How can I match all of these and optimize my RegEx? I'm still new to this.
I asume this is what you want:
r"\[(.*?)\](.*?)\s*-\s*(.*?)(?:\[(\d{2}\.\d{2}\.\d{2})\])?\[(.*?)\]"g
Consider approaching this with pandas as shown below:
import pandas as pd
# create a Series object containing the strings to be searched
s = pd.Series([
'[Test]Kyubiikitsune - Company Of Wolves[20.06.96][Hi-Res]',
'[TEst]_ANother - Company Of 2[Hi-Res]',
'[Yes]coOl__ - some text_[20.06.96][Hi-Res]'
])
# use pandas' StringMethods to peform regex extraction; a DataFrame object is returned because your regex contains more than one capture group
s.str.extract('(\[[A-Za-z]*\])(.*) - (.*)(.[0-9]{2}\.[0-9]{2}\.[0-9]{2}.)?(\[.*\])', expand=True)
# returns the following
0 1 2 3 4
0 [Test] Kyubiikitsune Company Of Wolves[20.06.96] NaN [Hi-Res]
1 [TEst] _ANother Company Of 2 NaN [Hi-Res]
2 [Yes] coOl__ some text_[20.06.96] NaN [Hi-Res]

Convert a regex expression to erlang's re syntax?

I am having hard time trying to convert the following regular expression into an erlang syntax.
What I have is a test string like this:
1,2 ==> 3 #SUP: 1 #CONF: 1.0
And the regex that I created with regex101 is this (see below):
([\d,]+).*==>\s*(\d+)\s*#SUP:\s*(\d)\s*#CONF:\s*(\d+.\d+)
:
But I am getting weird match results if I convert it to erlang - here is my attempt:
{ok, M} = re:compile("([\\d,]+).*==>\\s*(\\d+)\\s*#SUP:\\s*(\\d)\\s*#CONF:\\s*(\\d+.\\d+)").
re:run("1,2 ==> 3 #SUP: 1 #CONF: 1.0", M).
Also, I get more than four matches. What am I doing wrong?
Here is the regex101 version:
https://regex101.com/r/xJ9fP2/1
I don't know much about erlang, but I will try to explain. With your regex
>{ok, M} = re:compile("([\\d,]+).*==>\\s*(\\d+)\\s*#SUP:\\s*(\\d)\\s*#CONF:\\s*(\\d+.\\d+)").
>re:run("1,2 ==> 3 #SUP: 1 #CONF: 1.0", M).
{match,[{0, 28},{0,3},{8,1},{16,1},{25,3}]}
^^ ^^
|| ||
|| Total number of matched characters from starting index
Starting index of match
Reason for more than four groups
First match always indicates the entire string that is matched by the complete regex and rest here are the four captured groups you want. So there are total 5 groups.
([\\d,]+).*==>\\s*(\\d+)\\s*#SUP:\\s*(\\d)\\s*#CONF:\\s*(\\d+.\\d+)
<-------> <----> <---> <--------->
First group Second group Third group Fourth group
<----------------------------------------------------------------->
This regex matches entire string and is first match you are getting
(Zero'th group)
How to find desired answer
Here we want anything except the first group (which is entire match by regex). So we can use all_but_first to avoid the first group
> re:run("1,2 ==> 3 #SUP: 1 #CONF: 1.0", M, [{capture, all_but_first, list}]).
{match,["1,2","3","1","1.0"]}
More info can be found here
If you are in doubt what is content of the string, you can print it and check out:
1> RE = "([\\d,]+).*==>\\s*(\\d+)\\s*#SUP:\\s*(\\d)\\s*#CONF:\\s*(\\d+.\\d+)".
"([\\d,]+).*==>\\s*(\\d+)\\s*#SUP:\\s*(\\d)\\s*#CONF:\\s*(\\d+.\\d+)"
2> io:format("RE: /~s/~n", [RE]).
RE: /([\d,]+).*==>\s*(\d+)\s*#SUP:\s*(\d)\s*#CONF:\s*(\d+.\d+)/
For the rest of issue, there is great answer by rock321987.