Separating columns based on Regex | Pandas - regex

So I have converted a pdf to a dataframe and am almost in the final stages of what I wish the format to be. However I am stuck in the following step. I have a column which is like -
Column A
1234[321]
321[3]
123
456[456]
and want to separate it into two different columns B and C such that -
Column B Column C
1234 321
321 3
123 0
456 456
How can this be achieved? I did try something along the lines of
df.Column A.str.strip(r"\[\d+\]")
but I have not been able to get through after trying different variations. Any help will be greatly appreciated as this is the final part of this task. Much thanks in advance!

An alternative could be:
# Create the new two columns
df[["Column B", "Column C"]]=df["Column A"].str.split('[', expand=True)
# Get rid of the extra bracket
df["Column C"] = df["Column C"].str.replace("]", "")
# Get rid of the NaN and the useless column
df = df.fillna(0).drop("Column A", axis=1)
# Convert all columns to numeric
df = df.apply(pd.to_numeric)

You may use
import pandas as pd
df = pd.DataFrame({'Column A': ['1234[321]', '321[3]', '123', '456[456]']})
df[['Column B', 'Column C']] = df['Column A'].str.extract(r'^(\d+)(?:\[(\d+)])?$', expand=False)
# If you need to drop Column A here, use
# df[['Column B', 'Column C']] = df.pop('Column A').str.extract(r'^(\d+)(?:\[(\d+)])?$', expand=False)
df['Column C'][pd.isna(df['Column C'])] = 0
df
# Column A Column B Column C
# 0 1234[321] 1234 321
# 1 321[3] 321 3
# 2 123 123 0
# 3 456[456] 456 456
See the regex demo. It matches
^ - start of string
(\d+) - Group 1: one or more digits
(?:\[(\d+)])? - an optional non-capturing group matching [, then capturing into Group 2 one or more digits, and then a ]
$ - end of string.

Related

Different ouput for pd.str.extract() and re.search()

As seen in my previous question
Rename columns regex, keep name if no match
Why is there a different output of the regex?
data = {'First_Column': [1,2,3], 'Second_Column': [1,2,3],
'\First\Mid\LAST.Ending': [1,2,3], 'First1\Mid1\LAST1.Ending': [1,2,3]}
df = pd.DataFrame(data)
First_Column Second_Column \First\Mid\LAST.Ending First1\Mid1\LAST1.Ending
pd.str.extract()
df.columns.str.extract(r'([^\\]+)\.Ending')
0
0 NaN
1 NaN
2 LAST
3 LAST1
re.search()
col = df.columns.tolist()
for i in col[2:]:
print(re.search(r'([^\\]+)\.Ending', i).group())
LAST.Ending
LAST1.Ending
THX
From pandas.Series.str.extract docs
Extract capture groups in the regex pat as columns in a DataFrame.
It returns the capture group. Whereas, re.search with group() or group(0) returns the whole match, but if you change to group(1) it will return the capture group 1.
This will return full match:
for i in col[2:]:
print(re.search(r'([^\\]+)\.Ending', i).group())
LAST.Ending
LAST1.Ending
This will return only the capture group:
for i in col[2:]:
print(re.search(r'([^\\]+)\.Ending', i).group(1))
LAST
LAST1
Further read Link

Pandas regex split on characters and group

I've never got around to learning regex till now, but I'm trying to figure out how to use it in pandas with Series.str.match(expression) In order to split one column to make two new columns. (I know I can do this without regex)
examples of the column data are:
True Grit {'Rooster Cogburn'}
The King's Speech {'King George VI'}
Biutiful {'Uxbal'}
Where there can be any number of strings greater than 1 in each of the two groupings. How can I extract two groups to result in True Grit, Rooster Cogburn?
Given this dataframe
col
0 True Grit {Rooster Cogburn}
1 The King's Speech {King George VI}
2 Biutiful {Uxbal}
df = df.col.str.extract('(.*)\s*{(.*)}', expand = True)
will return
0 1
0 True Grit Rooster Cogburn
1 The King's Speech King George VI
2 Biutiful Uxbal

[Python3]RegEx to match multiple strings

I am trying to match multiple stings, which also includes an optional capture group.
My RegEx:
(\[[A-Za-z]*\])(.*) - (.*)(.[0-9]{2}\.[0-9]{2}\.[0-9]{2}.)?(\[.*\])
Strings:
[Test]Kyubiikitsune - Company Of Wolves[20.06.96][Hi-Res]
[TEst]_ANother - Company Of 2[Hi-Res]
[Yes]coOl__ - some text_[20.06.96][Hi-Res]
How can I match all of these and optimize my RegEx? I'm still new to this.
I asume this is what you want:
r"\[(.*?)\](.*?)\s*-\s*(.*?)(?:\[(\d{2}\.\d{2}\.\d{2})\])?\[(.*?)\]"g
Consider approaching this with pandas as shown below:
import pandas as pd
# create a Series object containing the strings to be searched
s = pd.Series([
'[Test]Kyubiikitsune - Company Of Wolves[20.06.96][Hi-Res]',
'[TEst]_ANother - Company Of 2[Hi-Res]',
'[Yes]coOl__ - some text_[20.06.96][Hi-Res]'
])
# use pandas' StringMethods to peform regex extraction; a DataFrame object is returned because your regex contains more than one capture group
s.str.extract('(\[[A-Za-z]*\])(.*) - (.*)(.[0-9]{2}\.[0-9]{2}\.[0-9]{2}.)?(\[.*\])', expand=True)
# returns the following
0 1 2 3 4
0 [Test] Kyubiikitsune Company Of Wolves[20.06.96] NaN [Hi-Res]
1 [TEst] _ANother Company Of 2 NaN [Hi-Res]
2 [Yes] coOl__ some text_[20.06.96] NaN [Hi-Res]

sql lite regexp - why returns values in digits instead of actual value

I am having this issue. in col1 there are values abc, def, ghi,
when I use regexp
select from col1, col1 regexp '[a-z]' as result from table_name
I receive returning values with 0 or 1. I understand regexp is explaning me that with 0 there is no [a-z] chars within string and with 1 there are, but I am looking for a solution, where regexp would return me how many of chars are within particular cell, but not in option 3 for abc, but 111.
so for an example:
abc = 111
is this possible?
Thanks.
I have found solution my self:
select col1,regexp_replace(col1, '[a-z]', '1') from my_table;

How to extract rows where a string matches?

my data is as below and want to extract only those rows where data column has string like "7_" and its position is not fix. I wand to extract those rows where "7_" matches and data column should have only "7_" values.
Row No Name data
1 ABC 4_6035;9_47;7_113838;0_14
2 xyz 0_6035;7_145
3 MNO 4_6035;5_47;8_113838;7_14
4 PPP 0_6035;5_145
Output I am looking for is
Row No Name data
1 ABC 7_113838
2 xyz 7_145
3 MNO 7_14
Please help.
^(?=.*\\b7_).*$
You can try this.See demo.
https://regex101.com/r/oL9kE8/10
Try this
within(df[grep("7_", df$data, fixed = TRUE), ],
data <- sub(".*?(7_[^;]*).*", "\\1", data))
# RowNo Name data
# 1 1 ABC 7_113838
# 2 2 xyz 7_145
# 3 3 MNO 7_14