How to drop patter inside table dataframe - regex

I have a date column in a dataframe, and the dates are like "2020-03-02" or "2020-02"
I want to remove the "-03-" for the dates like this.
I try:
df.Collection_Date.replace(to_replace ='-\d\d-', value = '-', regex = True)
but return a like a new dataframe with "0"
how i can remove it?

You may convert the column to string type, and then run the regex replacement:
df['Collection_Date']=df['Collection_Date'].astype('str').str.replace(r'-\d\d-','')
# ^^^^^^^^^^^^^

Related

Use column name in regex in pandas

I use:
df[df['A'].astype(str).str.contains("^XYZ|^$", regex=True)]
to select rows where the value in column A starts with a pattern ('XYZ') or is an empty string. I need to use the value of another column (e.g. column 'B') instead of XYZ. How is it possible to include the name of this column in the regex? Is it even possible?
A possible solution is to use re.search with DataFrame.apply():
import pandas as pd
import re
df = pd.DataFrame(
{'A':['XYZ won the match.', '', 'ZYX lost.'],
'B':['XYZ', 'WORD', 'BAC']
})
df[df.apply(lambda row: bool(re.search(fr"^{re.escape(row['B'])}|^$", row['A'])), axis=1)]
## If the values in Column B are valid regexps:
# df[df.apply(lambda row: bool(re.search(fr"^{row['B']}|^$", row['A'])), axis=1)]
Output:
A B
0 XYZ won the match. XYZ
1 WORD
Note the fr"^{re.escape(row['B'])}|^$ part build the pattern dynamically from the row['B'] values and all special chars in the string are escaped with re.escape to avoid regex matching issue. You do not need re.escape if the values in Column B are valid regular expressions.

loop regex over a pandas series

I have a list of words and I want to check if these exist in the column 'text' of a dataframe. I wrote the regex and I want to loop the regex over the column and extract the matching word, and then remove duplicates to obtain the unique matching words.
regex_list = []
for lex in deplex_fin:
regex_list.append('/(^|\W)'+lex+'($|\W)/i')
matching_words_list = []
for regex in regex_list:
matching_words_df = neg_sent['cleanText_emrm'].str.extract(regex)
matching_words = list(matching_words_df.iloc[:,0])
for item in matching_words:
if str(item) != 'nan':
matching_words_list.append(item)
But this is taking too long -- is there any faster way to do this?

Pandas dataframe replace string in multiple columns by finding substring

I have a very large pandas data frame containing both string and integer columns. I'd like to search the whole data frame for a specific substring, and if found, replace the full string with something else.
I've found some examples that do this by specifying the column(s) to search, like this:
df = pd.DataFrame([[1,'A'], [2,'(B,D,E)'], [3,'C']],columns=['Question','Answer'])
df.loc[df['Answer'].str.contains(','), 'Answer'] = 'X'
But because my data frame has dozens of string columns in no particular order, I don't want to specify them all. As far as I can tell using df.replace will not work since I'm only searching for a substring. Thanks for your help!
You can use data frame replace method with regex=True, and use .*,.* to match strings that contain a comma (you can replace comma with other any other substring you want to detect):
str_cols = ['Answer'] # specify columns you want to replace
df[str_cols] = df[str_cols].replace('.*,.*', 'X', regex=True)
df
#Question Answer
#0 1 A
#1 2 X
#2 3 C
or if you want to replace all string columns:
str_cols = df.select_dtypes(['object']).columns

Big Query Regex for Date ETL

I have data with Date info imported in Big Query in format 2/13/2016 , 3/4/2012 etc
I want to convert it into Date format like 02-12-2016 and 03-04-2012.
I want to use a Query to create a new column and use regex for the same.
I know the regex to match the first part (2) of 2/4/2012 will be something like
^(\d{1})(/|-)
Reg ex to match the the 2nd part with / would be
(/)(\d{1})(/)
I am wondering how to use these 2 regex along with REGEXP_EXTRACT and REGEXP_REPLACE to create a new column with these dates in correct format.
It might be easiest just to convert to a DATE type column. For example:
#standardSQL
SELECT
PARSE_DATE('%m/%d/%Y', date_string) AS date
FROM (
SELECT '2/13/2016' AS date_string UNION ALL
SELECT '3/4/2012' AS date_string
);
Another option--if you want to keep the dates as strings--is to use REPLACE:
#standardSQL
SELECT
REPLACE(date_string, '/', '-') AS date
FROM (
SELECT '2/13/2016' AS date_string UNION ALL
SELECT '3/4/2012' AS date_string
);

Removing strings that match multiple regex patterns from pandas series

I have a Pandas dataframe column containing text that needs to be cleaned of strings that match various regex patterns. My current attempt (given below) loops through each pattern, creating a new column containing the match if found, and then loops through the dataframe, splitting the column at the found match. I then drop the unneeded matching column 're_match'.
While this works for my current use case, I can't help but think that there must be a much more efficient, vectorised way of doing this in pandas, without needing to use iterrows() and creating a new column. My question is, is there a more optimal way of removing strings that match multiple regex patterns from a column?
In my current use case the unwanted strings are always at the end of the text block, hence, the use of split(...)[0]. However, it would be great if the unwanted strings could be extracted from any point in the text.
Also, note that combining the regexes into one long single pattern would be unpreferrable, as there are tens of patterns of which will change on a regular basis.
df = pd.read_csv('data.csv', index_col=0)
patterns = [
'( regex1 \d+)',
'((?: regex 2)? \d{1,2} )',
'( \d{0,2}.?\d{0,2}-?\d{1,2}.?\d{0,2}regex3 )',
]
for p in patterns:
df['re_match'] = df['text'].str.extract(
pat=p, flags=re.IGNORECASE, expand=False
)
df['re_match'] = df['re_match'].fillna('xxxxxxxxxxxxxxx')
for index, row in df.iterrows():
df.loc[index, 'text'] = row['text'].split(row['re_match'])[0]
df = df.drop('re_match', axis=1)
Thank you for your help
There is indeed and it is called df.applymap(some_function).
Consider the following example:
from pandas import DataFrame
import pandas as pd, re
df = DataFrame({'key1': ['1000', '2000'], 'key2': ['3000', 'digits(1234)']})
def cleanitup(val):
""" Multiplies digit values """
rx = re.compile(r'^\d+$')
if rx.match(val):
return int(val) * 10
else:
return val
# here is where the magic starts
df.applymap(cleanitup)
Obviously, I made it up, but now in every cell with only digits before, these have been multiplied by 10, every other value has been left untouched.
With this in mind, you can check and rearrange your values if necessary in the function cleanitup().