Pandas dataframe replace string in multiple columns by finding substring - python-2.7

I have a very large pandas data frame containing both string and integer columns. I'd like to search the whole data frame for a specific substring, and if found, replace the full string with something else.
I've found some examples that do this by specifying the column(s) to search, like this:
df = pd.DataFrame([[1,'A'], [2,'(B,D,E)'], [3,'C']],columns=['Question','Answer'])
df.loc[df['Answer'].str.contains(','), 'Answer'] = 'X'
But because my data frame has dozens of string columns in no particular order, I don't want to specify them all. As far as I can tell using df.replace will not work since I'm only searching for a substring. Thanks for your help!

You can use data frame replace method with regex=True, and use .*,.* to match strings that contain a comma (you can replace comma with other any other substring you want to detect):
str_cols = ['Answer'] # specify columns you want to replace
df[str_cols] = df[str_cols].replace('.*,.*', 'X', regex=True)
df
#Question Answer
#0 1 A
#1 2 X
#2 3 C
or if you want to replace all string columns:
str_cols = df.select_dtypes(['object']).columns

Related

Use column name in regex in pandas

I use:
df[df['A'].astype(str).str.contains("^XYZ|^$", regex=True)]
to select rows where the value in column A starts with a pattern ('XYZ') or is an empty string. I need to use the value of another column (e.g. column 'B') instead of XYZ. How is it possible to include the name of this column in the regex? Is it even possible?
A possible solution is to use re.search with DataFrame.apply():
import pandas as pd
import re
df = pd.DataFrame(
{'A':['XYZ won the match.', '', 'ZYX lost.'],
'B':['XYZ', 'WORD', 'BAC']
})
df[df.apply(lambda row: bool(re.search(fr"^{re.escape(row['B'])}|^$", row['A'])), axis=1)]
## If the values in Column B are valid regexps:
# df[df.apply(lambda row: bool(re.search(fr"^{row['B']}|^$", row['A'])), axis=1)]
Output:
A B
0 XYZ won the match. XYZ
1 WORD
Note the fr"^{re.escape(row['B'])}|^$ part build the pattern dynamically from the row['B'] values and all special chars in the string are escaped with re.escape to avoid regex matching issue. You do not need re.escape if the values in Column B are valid regular expressions.

How to remove excess newline characters from strings within a Pandas DataFrame

My intention is to remove unnecessary newline characters from strings inside of a DataFrame.
Example:
import pandas as pd
data = ['I like this product\n\nThe product is good']
dataf = pd.DataFrame(data)
Original data:
I like this product
The product is good
I tried the following, which was not successful, since all of the newline characters were removed, whereas I wanted to keep one of them.
dataf['new'] = dataf.replace('\\n','', regex=True)
The result was this, all newline characters were removed:
I like this productThe product is good
The result I am trying to achieve is this:
I like this product
The product is good
 
This should work:
dataf['new'] = dataf.replace(r'(\n)+', r'\n', regex=True)
The + indicates one or more occurrences of the preceding pattern, and however many there are, they will all be replaced by just one newline character.

Removing strings that match multiple regex patterns from pandas series

I have a Pandas dataframe column containing text that needs to be cleaned of strings that match various regex patterns. My current attempt (given below) loops through each pattern, creating a new column containing the match if found, and then loops through the dataframe, splitting the column at the found match. I then drop the unneeded matching column 're_match'.
While this works for my current use case, I can't help but think that there must be a much more efficient, vectorised way of doing this in pandas, without needing to use iterrows() and creating a new column. My question is, is there a more optimal way of removing strings that match multiple regex patterns from a column?
In my current use case the unwanted strings are always at the end of the text block, hence, the use of split(...)[0]. However, it would be great if the unwanted strings could be extracted from any point in the text.
Also, note that combining the regexes into one long single pattern would be unpreferrable, as there are tens of patterns of which will change on a regular basis.
df = pd.read_csv('data.csv', index_col=0)
patterns = [
'( regex1 \d+)',
'((?: regex 2)? \d{1,2} )',
'( \d{0,2}.?\d{0,2}-?\d{1,2}.?\d{0,2}regex3 )',
]
for p in patterns:
df['re_match'] = df['text'].str.extract(
pat=p, flags=re.IGNORECASE, expand=False
)
df['re_match'] = df['re_match'].fillna('xxxxxxxxxxxxxxx')
for index, row in df.iterrows():
df.loc[index, 'text'] = row['text'].split(row['re_match'])[0]
df = df.drop('re_match', axis=1)
Thank you for your help
There is indeed and it is called df.applymap(some_function).
Consider the following example:
from pandas import DataFrame
import pandas as pd, re
df = DataFrame({'key1': ['1000', '2000'], 'key2': ['3000', 'digits(1234)']})
def cleanitup(val):
""" Multiplies digit values """
rx = re.compile(r'^\d+$')
if rx.match(val):
return int(val) * 10
else:
return val
# here is where the magic starts
df.applymap(cleanitup)
Obviously, I made it up, but now in every cell with only digits before, these have been multiplied by 10, every other value has been left untouched.
With this in mind, you can check and rearrange your values if necessary in the function cleanitup().

filtering columns by regex in dataframe

I have a large dataframe (3000+ columns) and I am trying to get a list of all column names that follow this pattern:
"stat.mineBlock.minecraft.123456stone"
"stat.mineBlock.minecraft.DFHFFBSBstone2"
"stat.mineBlock.minecraft.AAAstoneAAAA"
My code:
stoneCombined<-grep("^[stat.mineBlock.minecraft.][a-zA-Z0-9]*?[stone][a-zA-Z0-9]*?", colnames(ingame), ignore.case =T)
where ingame is the dataframe I am searching. My code returns a list of numbers however instead of the dataframe columns (like those above) that I was expecting. Con someone tell me why?
After adding value=TRUE (Thanks to user227710):
I now get column names, but I get every column in my dataset not those that contain : stat.mineBlock.minecraft. and stone like I was trying to get.
To return the column names you need to set value=TRUE as an additional argument of grep. The default option in grep is to set value=FALSE and so it will give you indices of the matched colnames. .
help("grep")
value
if FALSE, a vector containing the (integer) indices of the matches determined by grep is returned, and if TRUE, a vector containing the matching elements themselves is returned.
grep("your regex pattern", colnames(ingame),value=TRUE, ignore.case =T)
Here is a solution in dplyr:
library(dplyr)
your_df %>%
select(starts_with("stat.mineBlock.minecraft"))
The more general way to match a column name to a regex is with matches() inside select(). See ?select for more information.
My answer is based on this SO post. As per the regex, you were very close.
Just [] create a character class matching a single character from the defined set, and it is the main reason it was not working. Also, perl=T is always safer to use with regex in R.
So, here is my sample code:
df <- data.frame(
"stat.mineBlock.minecraft.123456stone" = 1,
"stat.mineBlock.minecraft.DFHFFBSBwater2" = 2,
"stat.mineBlock.minecraft.DFHFFBSBwater3" = 3,
"stat.mineBlock.minecraft.DFHFFBSBstone4" = 4
)
grep("^stat\\.mineBlock\\.minecraft\\.[a-zA-Z0-9]*?stone[a-zA-Z0-9]*?", colnames(df), value=TRUE, ignore.case=T, perl=T)
See IDEONE demo

R: replacing special character in multiple columns of a data frame

I try to replace the german special character "ö" in a dataframe by "oe". The charcter occurs in multiple columns so I would like to be able to do this all in one by not having to specify individual columns.
Here is a small example of the data frame
data <- data.frame(a=c("aö","ab","ac"),b=c("bö","bb","ab"),c=c("öc","öb","acö"))
I tried :
data[data=="ö"]<-"oe"
but this did not work since I would need to work with regular expressions here. However when I try :
data[grepl("ö",data)]<-"oe"
I do not get what I want.
The dataframe at the end should look like:
> data
a b c
1 aoe boe oec
2 ab bb oeb
3 ac ab acoe
>
The file is a csv import that I import by read.csv. However, there seems to be no option to change to fix this with the import statement.
How do I get the desired outcome?
Here's one way to do it:
data <- apply(data,2,function(x) gsub("ö",'oe',x))
Explanation:
Your grepl doesn't work because grepl just returns a boolean matrix (TRUE/FALSE) corresponding to the elements in your data frame for which the regex matches. What the assignment then does is replace not just the character you want replaced but the entire string. To replace part of a string, you need sub (if you want to replace just once in each string) or gsub (if you want all occurrences replaces). To apply that to every column you loop over the columns using apply.
If you want to return a data frame, you can use:
data.frame(lapply(data, gsub, pattern = "ö", replacement = "oe"))