Different ouput for pd.str.extract() and re.search()

Different ouput for pd.str.extract() and re.search() - regex

As seen in my previous question
Rename columns regex, keep name if no match
Why is there a different output of the regex?
data = {'First_Column': [1,2,3], 'Second_Column': [1,2,3],
'\First\Mid\LAST.Ending': [1,2,3], 'First1\Mid1\LAST1.Ending': [1,2,3]}
df = pd.DataFrame(data)
First_Column Second_Column \First\Mid\LAST.Ending First1\Mid1\LAST1.Ending
pd.str.extract()
df.columns.str.extract(r'([^\\]+)\.Ending')
0
0 NaN
1 NaN
2 LAST
3 LAST1
re.search()
col = df.columns.tolist()
for i in col[2:]:
print(re.search(r'([^\\]+)\.Ending', i).group())
LAST.Ending
LAST1.Ending
THX

From pandas.Series.str.extract docs
Extract capture groups in the regex pat as columns in a DataFrame.
It returns the capture group. Whereas, re.search with group() or group(0) returns the whole match, but if you change to group(1) it will return the capture group 1.
This will return full match:
for i in col[2:]:
print(re.search(r'([^\\]+)\.Ending', i).group())
LAST.Ending
LAST1.Ending
This will return only the capture group:
for i in col[2:]:
print(re.search(r'([^\\]+)\.Ending', i).group(1))
LAST
LAST1
Further read Link

Related

Separating columns based on Regex | Pandas

So I have converted a pdf to a dataframe and am almost in the final stages of what I wish the format to be. However I am stuck in the following step. I have a column which is like -
Column A
1234[321]
321[3]
123
456[456]
and want to separate it into two different columns B and C such that -
Column B Column C
1234 321
321 3
123 0
456 456
How can this be achieved? I did try something along the lines of
df.Column A.str.strip(r"\[\d+\]")
but I have not been able to get through after trying different variations. Any help will be greatly appreciated as this is the final part of this task. Much thanks in advance!

An alternative could be:
# Create the new two columns
df[["Column B", "Column C"]]=df["Column A"].str.split('[', expand=True)
# Get rid of the extra bracket
df["Column C"] = df["Column C"].str.replace("]", "")
# Get rid of the NaN and the useless column
df = df.fillna(0).drop("Column A", axis=1)
# Convert all columns to numeric
df = df.apply(pd.to_numeric)

You may use
import pandas as pd
df = pd.DataFrame({'Column A': ['1234[321]', '321[3]', '123', '456[456]']})
df[['Column B', 'Column C']] = df['Column A'].str.extract(r'^(\d+)(?:\[(\d+)])?$', expand=False)
# If you need to drop Column A here, use
# df[['Column B', 'Column C']] = df.pop('Column A').str.extract(r'^(\d+)(?:\[(\d+)])?$', expand=False)
df['Column C'][pd.isna(df['Column C'])] = 0
df
# Column A Column B Column C
# 0 1234[321] 1234 321
# 1 321[3] 321 3
# 2 123 123 0
# 3 456[456] 456 456
See the regex demo. It matches
^ - start of string
(\d+) - Group 1: one or more digits
(?:\[(\d+)])? - an optional non-capturing group matching [, then capturing into Group 2 one or more digits, and then a ]
$ - end of string.

How to capture a multiple patterns using regex?

I have several text files, containing error values. The values are different in each file and so i'm not able to get the exact line where the value is present.
The example is as follows:
v1 = 1111
v2 = A:10 B:2
Text:
12.10.08,11:12:39,183769 1111,10352,003,12,11:12:39,183 Syntax-->12345
(would like to capture v1)
01.01.02,06:10:56,243648 00488,00000,018,01,06:10:56,243 A:10 B:2--1212 (would like to capture v2)
The regex is as follows:
((\d{2}[.]\d{2}[.]\d{2}),(\d{2}[:]\d{2}[:]\d{2},\d*\s*(('+v1+')[,].*|\S*\s('+v2+')).*))
Irrespective of the value passed, it should go through text and grab the value. If v1 is present, should provide the complete text and if v2 is present the same.
But with one regex equation.

You might use:
\d{2}\.\d{2}\.\d{2},\d{2}:\d{2}:\d{2},\d{6}(?: \d{5}(?:,\d+)+:\d{2}:\d{2},\d+)? (\d{4}\b|[A-Z]:\d{2} [A-Z]:\d)
Explanation
\d{2}\.\d{2}\.\d{2},\d{2}:\d{2}:\d{2},\d{6} Match the format of the starting digits
(?: \d{5}(?:,\d+)+:\d{2}:\d{2},\d+)? Optionally match the part starting with 5 digits up until a time like format
( Capturing group
\d{4}\b Match 4 digits
| Or
[A-Z]:\d{2} [A-Z]:\d Match A:10 B: format
) Close group
Regex demo

Convert a regex expression to erlang's re syntax?

I am having hard time trying to convert the following regular expression into an erlang syntax.
What I have is a test string like this:
1,2 ==> 3 #SUP: 1 #CONF: 1.0
And the regex that I created with regex101 is this (see below):
([\d,]+).*==>\s*(\d+)\s*#SUP:\s*(\d)\s*#CONF:\s*(\d+.\d+)
:
But I am getting weird match results if I convert it to erlang - here is my attempt:
{ok, M} = re:compile("([\\d,]+).*==>\\s*(\\d+)\\s*#SUP:\\s*(\\d)\\s*#CONF:\\s*(\\d+.\\d+)").
re:run("1,2 ==> 3 #SUP: 1 #CONF: 1.0", M).
Also, I get more than four matches. What am I doing wrong?
Here is the regex101 version:
https://regex101.com/r/xJ9fP2/1

I don't know much about erlang, but I will try to explain. With your regex
>{ok, M} = re:compile("([\\d,]+).*==>\\s*(\\d+)\\s*#SUP:\\s*(\\d)\\s*#CONF:\\s*(\\d+.\\d+)").
>re:run("1,2 ==> 3 #SUP: 1 #CONF: 1.0", M).
{match,[{0, 28},{0,3},{8,1},{16,1},{25,3}]}
^^ ^^
|| ||
|| Total number of matched characters from starting index
Starting index of match
Reason for more than four groups
First match always indicates the entire string that is matched by the complete regex and rest here are the four captured groups you want. So there are total 5 groups.
([\\d,]+).*==>\\s*(\\d+)\\s*#SUP:\\s*(\\d)\\s*#CONF:\\s*(\\d+.\\d+)
<-------> <----> <---> <--------->
First group Second group Third group Fourth group
<----------------------------------------------------------------->
This regex matches entire string and is first match you are getting
(Zero'th group)
How to find desired answer
Here we want anything except the first group (which is entire match by regex). So we can use all_but_first to avoid the first group
> re:run("1,2 ==> 3 #SUP: 1 #CONF: 1.0", M, [{capture, all_but_first, list}]).
{match,["1,2","3","1","1.0"]}
More info can be found here

If you are in doubt what is content of the string, you can print it and check out:
1> RE = "([\\d,]+).*==>\\s*(\\d+)\\s*#SUP:\\s*(\\d)\\s*#CONF:\\s*(\\d+.\\d+)".
"([\\d,]+).*==>\\s*(\\d+)\\s*#SUP:\\s*(\\d)\\s*#CONF:\\s*(\\d+.\\d+)"
2> io:format("RE: /~s/~n", [RE]).
RE: /([\d,]+).*==>\s*(\d+)\s*#SUP:\s*(\d)\s*#CONF:\s*(\d+.\d+)/
For the rest of issue, there is great answer by rock321987.

Regex - capture all repeated iteration

I have a variable like this
var = "!123abcabc123!"
i'm trying to capture all the '123' and 'abc' in this var.
this regex (abc|123) retrieve what i want but...
My question is: when i try this regex !(abc|123)*! it retrieve only the last iteration. what will i do to get this output
MATCH 1
1. [1-4] `123`
MATCH 2
1. [4-7] `abc`
MATCH 3
1. [7-10] `abc`
MATCH 4
1. [10-13] `123`
https://regex101.com/r/mD4vM8/3
Thank you!!

If your language supports \G then you may free to use this.
(?:!|\G(?!^))\K(abc|123)(?=(?:abc|123)*!)
DEMO

Regex and capturing groups

I have my regex working to the point where I now have two groups of text, group 1 and group 2 - I'm only really interested in the group2 text. In the end how do I get just group 2 to match/display?
("token":")([^,]+)

Something like below should work:
>>> p = re.compile('("token":")([^,]+)')
>>> m = p.match('...')
>>> m.group(2)
This will get the content of the second group. (Taken from here)

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Different ouput for pd.str.extract() and re.search() - regex

Related

Separating columns based on Regex | Pandas

How to capture a multiple patterns using regex?

Convert a regex expression to erlang's re syntax?

Regex - capture all repeated iteration

Regex and capturing groups

Categories

Resources