Remove all words containing '#' from list in DataFrame - regex

I have a DataFrame in which one column contains lists of words.
>>dataset.head(1)
>> contain
0 ["name", "Place", "ect#gtr", "nick"]
1 ["gf#e", "nobel", "play", "hi"]
I want to remove all the words which contain '#'. In the above example, I want to remove "ect#gtr" and "gf#e".

Try This one
ab= np.column_stack([~df[col].str.contains(r"#") for col in df])
new_df=df.loc[ab.any(axis=1)]
print(new_df)

Use list comprehension with filtering, regex here is not necessary:
df = pd.DataFrame({'contain':[['name', 'Place', 'ect#gtr', 'nick'],
['gf#e', 'nobel', 'play', 'hi']]})
print (df)
contain
0 [name, Place, ect#gtr, nick]
1 [gf#e, nobel, play, hi]
df.contain = df.contain.apply(lambda x: [y for y in x if '#' not in y])
Or:
df.contain = [[y for y in x if '#' not in y] for x in df.contain]
print (df)
contain
0 [name, Place, nick]
1 [nobel, play, hi]
EDIT: For remove values in strings add split with join:
df = pd.DataFrame({'contain':['name Place ect#gtr nick',"gf#e nobel play hi"]})
print (df)
contain
0 name Place ect#gtr nick
1 gf#e nobel play hi
df.contain = df.contain.apply(lambda x: ' '.join([y for y in x.split() if '#' not in y]))
print (df)
contain
0 name Place nick
1 nobel play hi

Related

use a custom function to find all words in a column

Background
The following question is a variation from Unnest grab keywords/nextwords/beforewords function.
1) I have the following word_list
word_list = ['crayons', 'cars', 'camels']
2) And df1
l = ['there are many crayons, in the blue box crayons that are',
'cars! i like a lot of sports cars because they go fast',
'the camels, in the middle east have many camels to ride ']
df1 = pd.DataFrame(l, columns=['Text'])
df1
Text
0 there are many crayons, in the blue box crayons that are
1 cars! i like a lot of sports cars because they go fast
2 the camels, in the middle east have many camels to ride
3) I also have a function find_next_words which uses word_list to grab words from Text column in df1
def find_next_words(row, word_list):
sentence = row[0]
trigger_words = []
next_words = []
for keyword in word_list:
words = sentence.split()
for index in range(0, len(words) - 1):
if words[index] == keyword:
trigger_words.append(keyword)
next_words.append(words[index + 1:index + 3])
return pd.Series([trigger_words, next_words], index = ['TriggerWords','NextWords'])
4) And it's pieced together with the following
df2 = df1.join(df.apply(lambda x: find_next_words(x, word_list), axis=1))
Output
Text TriggerWords NextWords
0 [crayons] [[that, are]]
1 [cars] [[because, they]]
2 [camels] [[to, ride]]
Problem
5) The output misses the following
crayons, from row 0 of Text column df1
cars! from row 1 of Text column df1
camels, from row 2 of Text column df1
Goal
6) Grab all corresponding words from df1 even if the words in df1 have a slight variation e.g. crayons, cars! from the words in word_list
(For this toy example, I know I can easily fix this problem by just adding these word variations to word_list = ['crayons,','crayons', 'cars!',cars, 'camels,', 'camels']. But this would be impractical to do with my my real word_list, which contains ~20K words)
Desired Output
Text TriggerWords NextWords
0 [crayons, crayons] [[in, the], [that, are]]
1 [cars, cars] [[i,like],[because, they]]
2 [camels, camels] [[in, the], [to, ride]]
Questions
How do I 1) tweak my word_list (e.g. regex?) 2) or find_next_words function to achieve my desired output?
You can tweak your regex something like this
\b(crayons|cars|camels)\b(?:[^a-z\n]*([a-z]*)[^a-z\n]*([a-z]*))
Regex Demo
import nltk
change
words = sentence.split()
to
words = nltk.word_tokenize(sentence)
this leads to
'crayons', ','
instead of
'crayons,'
which allows find_next_words to correctly identify all words from word_list in Text column

How to remove empty values from the pandas DataFrame from a column type list

Just looking forward a solution to remove empty values from a column which has values as a list in a sense where we are already replacing some strings beforehand, where it's a column of string representation of lists.
In df.color we are Just replacing *._Blue with empty string:
Example DataFrame:
df = pd.DataFrame({ 'Bird': ["parrot", "Eagle", "Seagull"], 'color': [ "['Light_Blue','Green','Dark_Blue']", "['Sky_Blue','Black','White', 'Yellow','Gray']", "['White','Jet_Blue','Pink', 'Tan','Brown', 'Purple']"] })
>>> df
Bird color
0 parrot ['Light_Blue','Green','Dark_Blue']
1 Eagle ['Sky_Blue','Black','White', 'Yellow','Gray']
2 Seagull ['White','Jet_Blue','Pink', 'Tan','Brown', 'Pu...
Result of above DF:
>>> df['color'].str.replace(r'\w+_Blue\b', '')
0 ['','Green','']
1 ['','Black','White', 'Yellow','Gray']
2 ['White','','Pink', 'Tan','Brown', 'Purple']
Name: color, dtype: object
Usually in python it easily been done as follows..
>>> lst = ['','Green','']
>>> [x for x in lst if x]
['Green']
I'm afraid if something like below can be done.
df.color.mask(df == ' ')
You can using the explode(pandas 0.25.0) then concat the list back
df['color'].str.replace(r'\w+_Blue\b', '').explode().loc[lambda x : x!=''].groupby(level=0).apply(list)
You don't have a column of lists, you have a column that contains string representation of lists. You can do this all in a single step using ast.literal_eval and str.endswith. I would use a list-comprehension here which should be faster than apply
import ast
fixed = [
[el for el in lst if not el.endswith("Blue")]
for lst in df['color'].apply(ast.literal_eval)
]
df.assign(color=fixed)
Bird color
0 parrot [Green]
1 Eagle [Black, White, Yellow, Gray]
2 Seagull [White, Pink, Tan, Brown, Purple]
Another way using filter and apply:
(df['color'].str.replace(r'\w+_Blue\b', '')
.apply(lambda x: list(filter(bool, ast.literal_eval(x)))))
0 [Green]
1 [Black, White, Yellow, Gray]
2 [White, Pink, Tan, Brown, Purple]

extra commas when using read_csv causing too many "s in data frame

I'm trying to read in a large file (~8Gb) using pandas read_csv. In one of the columns in the data, there is sometimes a list which includes commas but it enclosed by curly brackets e.g.
"label1","label2","label3","label4","label5"
"{A1}","2","","False","{ "apple" : false, "pear" : false, "banana" : null}
Therefore, when these particular lines were read in I was getting the error "Error tokenizing data. C error: Expected 37 fields in line 35, saw 42". I found this solution which said to add
sep=",(?![^{]*})" into the read_csv arguments which worked with splitting the data correctly. However, the data now includes the quotation marks around every entry (this didn't happen before I added the sep argument in).
The data looks something like this now:
"label1" "label2" "label3" "label4" "label5"
"{A1}" "2" "" "False" "{ "apple" : false, "pear" : false, "banana" : null}"
meaning I can't use, for example, .describe(), etc on the numerical data because they're still strings.
Does anyone know of a way of reading it in without the quotation marks but still splitting the data where it is?
Very new to Python so apologies if there is an obvious solution.
serialdev found a solution to removing the "s but the data columns are objects and not what I would expect/want, e.g. the integer values aren't seen as integers.
The data needs to be split at "," explicitly (including the "s), is there a way of stating that in the read_csv arguments?
Thanks!
To read in the data structure you specified, where the last element is an unknown length.
"{A1}","2","","False","{ "apple" : false, "pear" : false, "banana" : null}"
"{A1}","2","","False","{ "apple" : false, "pear" : false, "banana" : null, "orange": "true"}"
Change the separate to a regular expression using a negative forward lookahead assertion. This will enable you to separate on a ',' only when not immediately followed by a space.
df = pd.read_csv('my_file.csv', sep='[,](?!\s)', engine='python', thousands='"')
print df
0 1 2 3 4
0 "{A1}" 2 NaN "False" "{ "apple" : false, "pear" : false, "banana" :...
1 "{A1}" 2 NaN "False" "{ "apple" : false, "pear" : false, "banana" :...
Specifying the thousands separator as the quote is a bit of a hackie way to parse fields contains a quoted integer into the correct datatype. You can achieve the same result using converters which can also remove the quotes from the strings should you need it to and cast "True" or "False" to a boolean.
If need remove " from column, use vectorized function str.strip:
import pandas as pd
mydata = [{'"first_name"': '"Bill"', '"age"': '"7"'},
{'"first_name"': '"Bob"', '"age"': '"8"'},
{'"first_name"': '"Ben"', '"age"': '"9"'}]
df = pd.DataFrame(mydata)
print (df)
"age" "first_name"
0 "7" "Bill"
1 "8" "Bob"
2 "9" "Ben"
df['"first_name"'] = df['"first_name"'].str.strip('"')
print (df)
"age" "first_name"
0 "7" Bill
1 "8" Bob
2 "9" Ben
If need apply function str.strip() to all columns, use:
df = pd.concat([df[col].str.strip('"') for col in df], axis=1)
df.columns = df.columns.str.strip('"')
print (df)
age first_name
0 7 Bill
1 8 Bob
2 9 Ben
Timings:
mydata = [{'"first_name"': '"Bill"', '"age"': '"7"'},
{'"first_name"': '"Bob"', '"age"': '"8"'},
{'"first_name"': '"Ben"', '"age"': '"9"'}]
df = pd.DataFrame(mydata)
df = pd.concat([df]*3, axis=1)
df.columns = ['"first_name1"','"age1"','"first_name2"','"age2"','"first_name3"','"age3"']
#create sample [300000 rows x 6 columns]
df = pd.concat([df]*100000).reset_index(drop=True)
df1,df2 = df.copy(),df.copy()
def a(df):
df.columns = df.columns.str.strip('"')
df['age1'] = df['age1'].str.strip('"')
df['first_name1'] = df['first_name1'].str.strip('"')
df['age2'] = df['age2'].str.strip('"')
df['first_name2'] = df['first_name2'].str.strip('"')
df['age3'] = df['age3'].str.strip('"')
df['first_name3'] = df['first_name3'].str.strip('"')
return df
def b(df):
#apply str function to all columns in dataframe
df = pd.concat([df[col].str.strip('"') for col in df], axis=1)
df.columns = df.columns.str.strip('"')
return df
def c(df):
#apply str function to all columns in dataframe
df = df.applymap(lambda x: x.lstrip('\"').rstrip('\"'))
df.columns = df.columns.str.strip('"')
return df
print (a(df))
print (b(df1))
print (c(df2))
In [135]: %timeit (a(df))
1 loop, best of 3: 635 ms per loop
In [136]: %timeit (b(df1))
1 loop, best of 3: 728 ms per loop
In [137]: %timeit (c(df2))
1 loop, best of 3: 1.21 s per loop
Would this work since you have all the data that you need:
.map(lambda x: x.lstrip('\"').rstrip('\"'))
So simply clean up all the occurrences of " afterwards
EDIT with example:
mydata = [{'"first_name"' : '"bill', 'age': '"75"'},
{'"first_name"' : '"bob', 'age': '"7"'},
{'"first_name"' : '"ben', 'age': '"77"'}]
IN: df = pd.DataFrame(mydata)
OUT:
"first_name" age
0 "bill "75"
1 "bob "7"
2 "ben "77"
IN: df['"first_name"'] = df['"first_name"'].map(lambda x: x.lstrip('\"').rstrip('\"'))
OUT:
0 bill
1 bob
2 ben
Name: "first_name", dtype: object
Use this sequence after selecting the column, it is not ideal but will get the job done:
.map(lambda x: x.lstrip('\"').rstrip('\"'))
You can change the Dtypes after using this pattern:
df['col'].apply(lambda x: pd.to_numeric(x, errors='ignore'))
or simply:
df[['col2','col3']] = df[['col2','col3']].apply(pd.to_numeric)
It depend on your file. Did you check your data if there is comma or not, in cell ? If you have like this e.g Banana : Fruit, Tropical, Eatable, etc. in same cell, you're gonna get this kind of bug. One of basic solution is removing all commas in a file. Or, if you can read it, you can remove special characters :
>>>df
Banana
0 Hello, Salut, Salom
1 Bonjour
>>>df['Banana'] = df['Banana'].str.replace(',','')
>>>df
Banana
0 Hello Salut Salom
1 Bonjour

Grabbing columns with special characters and upper case letters

I have a data frame and I'm trying to loop through the data frame to identify those columns which contain a special character or which are all capital letters.
I have tried a few things but nothing where I'm apple to catch the column names within the loop.
data = data.frame(one=c(1,3,5,1,3,5,1,3,5,1,3,5), two=c(1,3,5,1,3,5,1,3,5,1,3,5),
thr=c("A","B","D","E","F","G","H","I","J","H","I","J"),
fou=c("A","B","D","A","B","D","A","B","D","A","B","D"),
fiv=c(1,3,5,1,3,5,1,3,5,1,3,5),
six=c("A","B","D","E","F","G","H","I","J","H","I","J"),
sev=c("A","B","D","A","B","D","A","B","D","A","B","D"),
eig=c("A","B","D","A","B","D","A","B","D","A","B","D"),
nin=c(1.24,3.52,5.33,1.44,3.11,5.33,1.55,3.66,5.33,1.32,3.54,5.77),
ten=c(1:12),
ele=rep(1,12),
twe=c(1,2,1,2,1,2,1,2,1,2,1,2),
thir=c("THiS","THAT34","T(&*(", "!!!","#$#","$Q%J","who","THIS","this","this","this","this"),
stringsAsFactors = FALSE)
data
colls <- c()
spec=c("$","%","&")
for( col in names(data) ) {
if( length(strings[stringr::str_detect(data[,col], spec)]) >= 1 ){
print("HORRAY")
colls <- c(collls, col)
}
else print ("NOOOOOOOOOO")
}
for( col in names(data) ) {
if( any(data[,col]) %in% spec ){
print("HORRAY")
colls <- c(collls, col)
}
else print ("NOOOOOOOOOO")
}
Can anyone shed light on a good way to tackle this problem.
EDIT:
The end goal is to have a vector with a name of column names which meet that criteria. Sorry for my poor SO question, but hopefully this will help with what I'm trying to do
I would use grep() to search for the pattern you are interested in. See here.
[:upper:] Matches any upper case letters.
Combining it with anchors (^,$) and match one or more times (+) gives ^[[:upper:]]+$ and should only match entries completely in capitals.
The following would match the special characters in your toy data set (but is not guaranteed to match all special characters in your real data set i.e form feeds, carriage returns)
[:punct:] #Matches punctuation - ! " # $ % & ' ( ) * + , - . / : ; < = > ? # [ \ ] ^ _ ` { | } ~.
Note that rather than use [:punct:] you could define your special characters manually.
We can try the resultant code on the first row of your data set:
#Using grepl() rather than grep() so that we return a list of logical values.
grepl(x= data[1,], pattern = "^[[:upper:]]+$|[[:punct:]]")
[1] FALSE FALSE TRUE TRUE FALSE TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE
This gives us our expected response except for column nine which has the value 1.24. Here the decimal point is being recognised as punctuation and is being flagged as a match.
We can add a "negative lookahead assertion" - (?!\\.) - to remove any periods from consideration, before they are even tested for being punctuation characters. Note we use \ to escape the period.
grepl(x= data[1,], perl = TRUE, pattern = "(?!\\.)(^[[:upper:]]+$|[[:punct:]])")
[1] FALSE FALSE TRUE TRUE FALSE TRUE TRUE TRUE FALSE FALSE FALSE FALSE TRUE
This returns a better response - it now no longer matches decimal places. NOTE: This might not be what you want as this pattern also won't match any fullstops in character fields. You would need to refine the pattern further.
Rather than use a 'for loop' to reiterate this code across every row in your dataframe I would use vectorization instead which is 'more R like'.
To do this we must convert our script into a function which we will call with apply()
myFunction <- function(x){
matches <- grepl(x= x, perl = TRUE, pattern = "(?!\\.)(^[[:upper:]]+$|[[:punct:]])")
#Given a set of logical vectors 'matches', is at least one of the values true? using any()
return(any(matches))
}
apply(X = data, 1, myFunction)
The 1 above instructs apply() to reiterate across rows rather than columns.
[1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
In your example data set all rows have an entry containing a special character or a string of all capital letters. This is unsurprising as many columns in your example data set are a list of single capital letters.
If you are just interested in which values in column thirteen fit the stated criteria you can use:
matches <- grepl(x= data$thir, perl = TRUE, pattern = "(?!\\.)(^[[:upper:]]+$|[[:punct:]])")
matches
[1] FALSE FALSE TRUE TRUE TRUE TRUE FALSE TRUE FALSE FALSE FALSE FALSE
To subset your dataframe on matching rows:
data[matches,]
one two thr fou fiv six sev eig nin ten ele twe thir
3 5 5 D D 5 D D D 5.33 3 1 1 T(&*(
4 1 1 E A 1 E A A 1.44 4 1 2 !!!
5 3 3 F B 3 F B B 3.11 5 1 1 #$#
6 5 5 G D 5 G D D 5.33 6 1 2 $Q%J
8 3 3 I B 3 I B B 3.66 8 1 2 THIS
To subset your dataframe on non-matching rows:
data[!matches,]
one two thr fou fiv six sev eig nin ten ele twe thir
1 1 1 A A 1 A A A 1.24 1 1 1 THiS
2 3 3 B B 3 B B B 3.52 2 1 2 THAT34
7 1 1 H A 1 H A A 1.55 7 1 1 who
9 5 5 J D 5 J D D 5.33 9 1 1 this
10 1 1 H A 1 H A A 1.32 10 1 2 this
11 3 3 I B 3 I B B 3.54 11 1 1 this
12 5 5 J D 5 J D D 5.77 12 1 2 this
Note that the regular expression used doesn't match THAT34 as it isn't composed wholly of capitalised letters, having the number 34 at the end.
EDIT:
To get a list of column names identifying columns that fulfill the criteria in your edit use myFunction described above with:
colnames(data)[apply(X = data, 2, myFunction)]
"thr" "fou" "six" "sev" "eig" "thir"
The number in apply() changes from 1 to 2 to reiterate across columns rather than rows. We pass the output from apply(), a list of logical matches (TRUE or FALSE), to colnames(data) - this returns the matching column names via subsetting.
I would collapse the data into strings (one string per row)
strings = apply(data, 1, paste, collapse = "")
contains_only_caps = strings == toupper(strings)
strings[contains_only_caps]
# [1] "33BB3BBB3.52 212THAT34" "55DD5DDD5.33 311T(&*(" "11EA1EAA1.44 412!!!" "33FB3FBB3.11 511#$#"
# [5] "55GD5GDD5.33 612$Q%J" "33IB3IBB3.66 812THIS"
# escaping special characters
spec=c("\\$","%","\\&")
contains_spec = stringr::str_detect(strings, pattern = paste(spec, collapse = "|"))
strings[contains_spec]
# [1] "55DD5DDD5.33 311T(&*(" "33FB3FBB3.11 511#$#" "55GD5GDD5.33 612$Q%J"
You could also use which on contains_spec or contains_only_caps to get the corresponding row numbers for the original data frame. I think that using strings rather than row-wise data frame elements will by much faster - as long as you want to search the whole strings, not certain columns for certain conditions.

How to properly manipulate a string column in a data frame in R?

I have a data.frame with a string column that contains periods e.g "a.b.c.X". I want to split out the string by periods and retain the third segment e.g. "c" in the example given. Here is what I'm doing.
> df = data.frame(v=c("a.b.a.X", "a.b.b.X", "a.b.c.X"), b=seq(1,3))
> df
v b
1 a.b.a.X 1
2 a.b.b.X 2
3 a.b.c.X 3
And what I want is
> df = data.frame(v=c("a.b.a.X", "a.b.b.X", "a.b.c.X"), b=seq(1,3))
> df
v b
1 a 1
2 b 2
3 c 3
I'm attempting to use within, but I'm getting strange results. The value in the first row in the first column is being repeated.
> get = function(x) { unlist(strsplit(x, "\\."))[3] }
> within(df, v <- get(as.character(v)))
v b
1 a 1
2 a 2
3 a 3
What is the best practice for doing this? What am I doing wrong?
Update:
Here is the solution I used from #agstudy's answer:
> df = data.frame(v=c("a.b.a.X", "a.b.b.X", "a.b.c.X"), b=seq(1,3))
> get = function(x) gsub(".*?[.].*?[.](.*?)[.].*", '\\1', x)
> within(df, v <- get(v))
v b
1 a 1
2 b 2
3 c 3
Using some regular expression you can do :
gsub(".*?[.].*?[.](.*?)[.].*", '\\1', df$v)
[1] "a" "b" "c"
Or more concise:
gsub("(.*?[.]){2}(.*?)[.].*", '\\2', v)
The problem is not with within but with your get function. It returns a single character ("a") which gets recycled when added to your data.frame. Your code should look like this:
get.third <- function(x) sapply(strsplit(x, "\\."), `[[`, 3)
within(df, v <- get.third(as.character(v)))
Here is one possible solution:
df[, "v"] <- do.call(rbind, strsplit(as.character(df[, "v"]), "\\."))[, 3]
## > df
## v b
## 1 a 1
## 2 b 2
## 3 c 3
The answer to "what am I doing wrong" is that the bit of code that you thought was extracting the third element of each split string was actually putting all the elements of all your strings in a single vector, and then returning the third element of that:
get = function(x) {
splits = strsplit(x, "\\.")
print("All the elements: ")
print(unlist(splits))
print("The third element:")
print(unlist(splits)[3])
# What you actually wanted:
third_chars = sapply(splits, function (x) x[3])
}
within(df, v2 <- get(as.character(v)))