How to remove excess newline characters from strings within a Pandas DataFrame - regex

My intention is to remove unnecessary newline characters from strings inside of a DataFrame.
Example:
import pandas as pd
data = ['I like this product\n\nThe product is good']
dataf = pd.DataFrame(data)
Original data:
I like this product
The product is good
I tried the following, which was not successful, since all of the newline characters were removed, whereas I wanted to keep one of them.
dataf['new'] = dataf.replace('\\n','', regex=True)
The result was this, all newline characters were removed:
I like this productThe product is good
The result I am trying to achieve is this:
I like this product
The product is good
 

This should work:
dataf['new'] = dataf.replace(r'(\n)+', r'\n', regex=True)
The + indicates one or more occurrences of the preceding pattern, and however many there are, they will all be replaced by just one newline character.

Related

Pandas dataframe replace string in multiple columns by finding substring

I have a very large pandas data frame containing both string and integer columns. I'd like to search the whole data frame for a specific substring, and if found, replace the full string with something else.
I've found some examples that do this by specifying the column(s) to search, like this:
df = pd.DataFrame([[1,'A'], [2,'(B,D,E)'], [3,'C']],columns=['Question','Answer'])
df.loc[df['Answer'].str.contains(','), 'Answer'] = 'X'
But because my data frame has dozens of string columns in no particular order, I don't want to specify them all. As far as I can tell using df.replace will not work since I'm only searching for a substring. Thanks for your help!
You can use data frame replace method with regex=True, and use .*,.* to match strings that contain a comma (you can replace comma with other any other substring you want to detect):
str_cols = ['Answer'] # specify columns you want to replace
df[str_cols] = df[str_cols].replace('.*,.*', 'X', regex=True)
df
#Question Answer
#0 1 A
#1 2 X
#2 3 C
or if you want to replace all string columns:
str_cols = df.select_dtypes(['object']).columns

Removing strings that match multiple regex patterns from pandas series

I have a Pandas dataframe column containing text that needs to be cleaned of strings that match various regex patterns. My current attempt (given below) loops through each pattern, creating a new column containing the match if found, and then loops through the dataframe, splitting the column at the found match. I then drop the unneeded matching column 're_match'.
While this works for my current use case, I can't help but think that there must be a much more efficient, vectorised way of doing this in pandas, without needing to use iterrows() and creating a new column. My question is, is there a more optimal way of removing strings that match multiple regex patterns from a column?
In my current use case the unwanted strings are always at the end of the text block, hence, the use of split(...)[0]. However, it would be great if the unwanted strings could be extracted from any point in the text.
Also, note that combining the regexes into one long single pattern would be unpreferrable, as there are tens of patterns of which will change on a regular basis.
df = pd.read_csv('data.csv', index_col=0)
patterns = [
'( regex1 \d+)',
'((?: regex 2)? \d{1,2} )',
'( \d{0,2}.?\d{0,2}-?\d{1,2}.?\d{0,2}regex3 )',
]
for p in patterns:
df['re_match'] = df['text'].str.extract(
pat=p, flags=re.IGNORECASE, expand=False
)
df['re_match'] = df['re_match'].fillna('xxxxxxxxxxxxxxx')
for index, row in df.iterrows():
df.loc[index, 'text'] = row['text'].split(row['re_match'])[0]
df = df.drop('re_match', axis=1)
Thank you for your help
There is indeed and it is called df.applymap(some_function).
Consider the following example:
from pandas import DataFrame
import pandas as pd, re
df = DataFrame({'key1': ['1000', '2000'], 'key2': ['3000', 'digits(1234)']})
def cleanitup(val):
""" Multiplies digit values """
rx = re.compile(r'^\d+$')
if rx.match(val):
return int(val) * 10
else:
return val
# here is where the magic starts
df.applymap(cleanitup)
Obviously, I made it up, but now in every cell with only digits before, these have been multiplied by 10, every other value has been left untouched.
With this in mind, you can check and rearrange your values if necessary in the function cleanitup().

R: replacing special character in multiple columns of a data frame

I try to replace the german special character "ö" in a dataframe by "oe". The charcter occurs in multiple columns so I would like to be able to do this all in one by not having to specify individual columns.
Here is a small example of the data frame
data <- data.frame(a=c("aö","ab","ac"),b=c("bö","bb","ab"),c=c("öc","öb","acö"))
I tried :
data[data=="ö"]<-"oe"
but this did not work since I would need to work with regular expressions here. However when I try :
data[grepl("ö",data)]<-"oe"
I do not get what I want.
The dataframe at the end should look like:
> data
a b c
1 aoe boe oec
2 ab bb oeb
3 ac ab acoe
>
The file is a csv import that I import by read.csv. However, there seems to be no option to change to fix this with the import statement.
How do I get the desired outcome?
Here's one way to do it:
data <- apply(data,2,function(x) gsub("ö",'oe',x))
Explanation:
Your grepl doesn't work because grepl just returns a boolean matrix (TRUE/FALSE) corresponding to the elements in your data frame for which the regex matches. What the assignment then does is replace not just the character you want replaced but the entire string. To replace part of a string, you need sub (if you want to replace just once in each string) or gsub (if you want all occurrences replaces). To apply that to every column you loop over the columns using apply.
If you want to return a data frame, you can use:
data.frame(lapply(data, gsub, pattern = "ö", replacement = "oe"))

Remove fullstop, commas, quotation from list in Python

I have a python code for word frequency count from a text file. The problem with the program is that it takes fullstop into account hence altering the count. For counting word i've used a sorted list of words. I tried to remove the fullstop using
words = open(f, 'r').read().lower().split()
uniqueword = sorted(set(words))
uniqueword = uniqueword.replace(".","")
but i get error as
AttributeError: 'list' object has no attribute 'replace'
Any help would be appreciated :)
You can process the words before you make the set, using a list comprehension:
words = [word.replace(".", "") for word in words]
You could also remove them after (uniquewords = [word.replace...]), but then you will reintroduce duplicates.
Note that if you want to count these words, a Counter may be more useful:
from collections import Counter
counts = Counter(words)
You might be better off with
words = re.findall(r'\w+', open(f, 'r').read().lower())
which will grab all the strings composed of one or more “word characters” and will ignore punctuation and whitespace.

Extract root, month letter-year and yellow key from a Bloomberg futures ticker

A Bloomberg futures ticker usually looks like:
MCDZ3 Curcny
where the root is MCD, the month letter and year is Z3 and the 'yellow key' is Curcny.
Note that the root can be of variable length, 2-4 letters or 1 letter and 1 whitespace (e.g. S H4 Comdty).
The letter-year allows only the letter listed below in expr and can have two digit years.
Finally the yellow key can be one of several security type strings but I am interested in (Curncy|Equity|Index|Comdty) only.
In Matlab I have the following regular expression
expr = '[FGHJKMNQUVXZ]\d{1,2} ';
[rootyk, monthyear] = regexpi(bbergtickers, expr,'split','match','once');
where
rootyk{:}
ans =
'mcd' 'curncy'
and
monthyear =
'z3 '
I don't want to match the ' ' (space) in the monthyear. How can I do?
Assuming there are no leading or trailing whitespaces and only upcase letters in the root, this should work:
^([A-Z]{2,4}|[A-Z]\s)([FGHJKMNQUVXZ]\d{1,2}) (Curncy|Equity|Index|Comdty)$
You've got root in the first group, letter-year in the second, yellow key in the third.
I don't know Matlab nor whether it covers Perl Compatible Regex. If it fails, try e.g. with instead of \s. Also, drop the ^...$ if you'd like to extract from a bigger source text.
The expression you're feeding regexpi with contains a space and is used as a pattern for 'match'. This is why the matched monthyear string also has a space1.
If you want to keep it simple and let regexpi do the work for you (instead of postprocessing its output), try a different approach and capture tokens instead of matching, and ignore the intermediate space:
%// <$1><----------$2---------> <$3>
expr = '(.+)([FGHJKMNQUVXZ]\d{1,2}) (.+)';
tickinfo = regexpi(bbergtickers, expr, 'tokens', 'once');
You can also simplify the expression to a more genereic '(.+)(\w{1}\d{1,2})\s+(.+)', if you wish.
Example
bbergtickers = 'MCDZ3 Curncy';
expr = '(.+)([FGHJKMNQUVXZ]\d{1,2})\s+(.+)';
tickinfo = regexpi(bbergtickers, expr, 'tokens', 'once');
The result is:
tickinfo =
'MCD'
'Z3'
'Curncy'
1 This expression is also used as a delimiter for 'split'. Removing the trailing space from it won't help, as it will reappear in the rootyk output instead.
Assuming you just want to get rid of the leading and or trailing spaces at the edge, there is a very simple command for that:
monthyear = trim(monthyear)
For removing all spaces, you can do:
monthyear(isspace(monthyear))=[]
Here is a completely different approach, basically this searches the letter before your year number:
s = 'MCDZ3 Curcny'
p = regexp(s,'\d')
s(min(p)
s(min(p)-1:max(p))