An extension to :
Removing list of words from a string
I have following dataframe and I want to delete frequently occuring words from df.name column:
df :
name
Bill Hayden
Rock Clinton
Bill Gates
Vishal James
James Cameroon
Micky James
Michael Clark
Tony Waugh
Tom Clark
Tom Bill
Avinash Clinton
Shreyas Clinton
Ramesh Clinton
Adam Clark
I'm creating a new dataframe with words and their frequency with following code :
df = pd.DataFrame(data.name.str.split(expand=True).stack().value_counts())
df.reset_index(level=0, inplace=True)
df.columns = ['word', 'freq']
df = df[df['freq'] >= 3]
which will result in
df2 :
word freq
Clinton 4
Bill 3
James 3
Clark 3
Then I'm converting it into a dictionary with following code snippet :
d = dict(zip(df['word'], df['freq']))
Now if I've to remove words from df.name that are in d(which is dictionary, with word : freq), I'm using following code snippet :
def check_thresh_word(merc,d):
m = merc.split(' ')
for i in range(len(m)):
if m[i] in d.keys():
return False
else:
return True
def rm_freq_occurences(merc,d):
if check_thresh_word(merc,d) == False:
nwords = merc.split(' ')
rwords = [word for word in nwords if word not in d.keys()]
m = ' '.join(rwords)
else:
m=merc
return m
df['new_name'] = df['name'].apply(lambda x: rm_freq_occurences(x,d))
But in actual my dataframe(df) contains nearly 240k rows and i've to use threshold(thresh=3 in above sample) greater than 100.
So above code takes lots of time to run because of complex search.
Is there any effiecient way to make it faster??
Following is a desired output :
name
Hayden
Rock
Gates
Vishal
Cameroon
Micky
Michael
Tony Waugh
Tom
Tommy
Avinash
Shreyas
Ramesh
Adam
Thanks in advance!!!!!!!
Use replace by regex created by joined all values of column word, last strip traling whitespaces:
data.name = data.name.replace('|'.join(df['word']), '', regex=True).str.strip()
Another solution is add \s* for select zero or more whitespaces:
pat = '|'.join(['\s*{}\s*'.format(x) for x in df['word']])
print (pat)
\s*Clinton\s*|\s*James\s*|\s*Bill\s*|\s*Clark\s*
data.name = data.name.replace(pat, '', regex=True)
print (data)
name
0 Hayden
1 Rock
2 Gates
3 Vishal
4 Cameroon
5 Micky
6 Michael
7 Tony Waugh
8 Tom
9 Tom
10 Avinash
11 Shreyas
12 Ramesh
13 Adam
This is a rather tricky question indeed. It would be awesome if someone might be able to help me out.
What I'm trying to do is the following. I have data frame in R containing every locality in a given state, scraped from Wikipedia. It looks something like this (top 10 rows). Let's call it NewHampshire.df:
Municipality County Population
1 Acworth Sullivan 891
2 Albany Carroll 735
3 Alexandria Grafton 1613
4 Allenstown Merrimack 4322
5 Alstead Cheshire 1937
6 Alton Belknap 5250
7 Amherst Hillsborough 11201
8 Andover Merrimack 2371
9 Antrim Hillsborough 2637
10 Ashland Grafton 2076
I've further compiled a new variable called grep_term, which combines the values from Municipality and County into a new, variable that functions as an or-statement, something like this:
Municipality County Population grep_term
1 Acworth Sullivan 891 "Acworth|Sullivan"
2 Albany Carroll 735 "Albany|Carroll"
and so on. Furthermore, I have another dataset, containing self-disclosed locations of 2000 Twitter users. I call it location.df, and it looks a bit like this:
[1] "London" "Orleans village VT USA" "The World"
[4] "D M V Towson " "Playa del Sol Solidaridad" "Beautiful Downtown Burbank"
[7] NA "US" "Gaithersburg Md"
[10] NA "California " "Indy"
[13] "Florida" "exsnaveen com" "Houston TX"
I want to do two things:
1: Grepl through every observation in the location.df dataset, and save a TRUE or FALSE into a new variable depending on whether the self-disclosed location is part of the list in the first dataset.
2: Save the number of matches for a particular line in the NewHampshire.df dataset to a new variable. I.e., if there are 4 matches for Acworth in the twitter location dataset, there should be a value "4" for observation 1 in the NewHampshire.df on the newly created "matches" variable
What I've done so far: I've solved task 1, as follows:
for(i in 1:234){
location.df$isRelevant <- sapply(location.df$location, function(s) grepl(NH_Places[i], s, ignore.case = TRUE))
}
How can I solve task 2, ideally in the same for loop?
Thanks in advance, any help would be greatly appreciated!
With regard to task one, you could also use:
# location vector to be matched against
loc.vec <- c("Acworth","Hillsborough","California","Amherst","Grafton","Ashland","London")
location.df <- data.frame(location=loc.vec)
# create a 'grep-vector'
places <- paste(paste(NewHampshire$Municipality, NewHampshire$County,
sep = "|"),
collapse = "|")
# match them against the available locations
location.df$isRelevant <- sapply(location.df$location,
function(s) grepl(places, s, ignore.case = TRUE))
which gives:
> location.df
location isRelevant
1 Acworth TRUE
2 Hillsborough TRUE
3 California FALSE
4 Amherst TRUE
5 Grafton TRUE
6 Ashland TRUE
7 London FALSE
To get the number of matches in the location.df with the grep_term column, you can use:
NewHampshire$n.matches <- sapply(NewHampshire$grep_term, function(x) sum(grepl(x, loc.vec)))
gives:
> NewHampshire
Municipality County Population grep_term n.matches
1 Acworth Sullivan 891 Acworth|Sullivan 1
2 Albany Carroll 735 Albany|Carroll 0
3 Alexandria Grafton 1613 Alexandria|Grafton 1
4 Allenstown Merrimack 4322 Allenstown|Merrimack 0
5 Alstead Cheshire 1937 Alstead|Cheshire 0
6 Alton Belknap 5250 Alton|Belknap 0
7 Amherst Hillsborough 11201 Amherst|Hillsborough 2
8 Andover Merrimack 2371 Andover|Merrimack 0
9 Antrim Hillsborough 2637 Antrim|Hillsborough 1
10 Ashland Grafton 2076 Ashland|Grafton 2
Possible duplicate Here
I have a data frame of two columns. I want to remove the string in parenthesis and add that as a new column. Data frame is displayed below.
structure(list(ID = 1:12, Gene.Name = structure(c(3L, 11L, 9L,
5L, 1L, 8L, 2L, 4L, 6L, 12L, 10L, 7L), .Label = c(" ATP synt, H+ tran, O subunit (oligomycin sensitivity conferring protein) (ATP5O), mRNA",
" heterogeneous nuclear ribonucleoprotein F (HNRPF), mRNA", " NADH (ubiquinone) 1 alpha subcomplex, 4 (9kD, MLRQ) (NDUFA4), mRNA",
" ribosomal protein L34 (RPL34), transcript variant 1, mRNA",
" ribosomal protein S11 (RPS11), mRNA", "ATP synthase, H+ tran, mitochondrial F0, subunit c (subunit 9) isoform 3 (ATP5G3), mRNA",
"clone MGC:10120 IMAGE:3900723, mRNA, complete cds", "cytidine monophosphate N-acetylneuraminic acid synthetase (CMAS), mRNA",
"farnesyl-diphosphate farnesyltransferase 1 (FDFT1), mRNA", "homeobox protein from AL590526 (LOC84528), mRNA",
"mitochondrial S33 (MRPS33), transcript variant 1, nuclear gene, mRNA",
"ribosomal protein S15a (RPS15A), mRNA"), class = "factor")), .Names = c("ID",
"Gene.Name"), row.names = c(NA, -12L), class = "data.frame")
if the string in parenthesis is not found, then leave that row empty. Here i have two cases
1) Get all the string in parenthesis and add as a new column separated by ,
2) Last string in parenthesis and add as new column
I tried something like df$Symbol <- sapply(df, function(x) sub("\\).*", "", sub(".*\\(", "", x))) but does not give the appropriate output
Case 1 output
ID Gene.Name Symbol
1 NADH (ubiquinone) 1 alpha subcomplex, 4 (9kD, MLRQ) (NDUFA4), mRNA ubiquinone, (9kD, MLRQ),NDUFA4
2 mitochondrial S33 (MRPS33), transcript variant 1, nuclear gene, mRNA MRPS33
3 farnesyl-diphosphate farnesyltransferase 1 (FDFT1), mRNA FDFT1
4 ribosomal protein S11 (RPS11), mRNA RPS11
5 ATP synt, H+ tran, O subunit (oligomycin sensitivity conferring protein) (ATP5O), mRNA oligomycin sensitivity conferring protein,ATP5O
6 cytidine monophosphate N-acetylneuraminic acid synthetase (CMAS), mRNA CMAS
7 heterogeneous nuclear ribonucleoprotein F (HNRPF), mRNA HNRPF
8 ribosomal protein L34 (RPL34), transcript variant 1, mRNA RPL34
9 ATP synthase, H+ tran, mitochondrial F0, subunit c (subunit 9) isoform 3 (ATP5G3), mRNA subunit 9,ATP5G3
10 ribosomal protein S15a (RPS15A), mRNA RPS15A
11 homeobox protein from AL590526 (LOC84528), mRNA LOC84528
12 clone MGC:10120 IMAGE:3900723, mRNA, complete cds NA
Case 2 output
ID Gene.Name Symbol
1 NADH (ubiquinone) 1 alpha subcomplex, 4 (9kD, MLRQ) (NDUFA4), mRNA NDUFA4
2 mitochondrial S33 (MRPS33), transcript variant 1, nuclear gene, mRNA MRPS33
3 farnesyl-diphosphate farnesyltransferase 1 (FDFT1), mRNA FDFT1
4 ribosomal protein S11 (RPS11), mRNA RPS11
5 ATP synt, H+ tran, O subunit (oligomycin sensitivity conferring protein) (ATP5O), mRNA ATP5O
6 cytidine monophosphate N-acetylneuraminic acid synthetase (CMAS), mRNA CMAS
7 heterogeneous nuclear ribonucleoprotein F (HNRPF), mRNA HNRPF
8 ribosomal protein L34 (RPL34), transcript variant 1, mRNA RPL34
9 ATP synthase, H+ tran, mitochondrial F0, subunit c (subunit 9) isoform 3 (ATP5G3), mRNA ATP5G3
10 ribosomal protein S15a (RPS15A), mRNA RPS15A
11 homeobox protein from AL590526 (LOC84528), mRNA LOC84528
12 clone MGC:10120 IMAGE:3900723, mRNA, complete cds <NA>
An option using sub to get the words inside the round brackets at the end of the string.
Symbol <- sub('.*\\(([^\\)]+)\\)[^\\(]+$', '\\1',df1[,2])
df1$Symbol <- Symbol[1:nrow(df1)*NA^(!grepl('\\(',df1[,2]))]
df1$Symbol
#[1] "NDUFA4" "MRPS33" "FDFT1" "RPS11" "ATP5O" "CMAS"
#[7] "HNRPF" "RPL34" "ATP5G3" "RPS15A" "LOC84528" NA
Update
For the first case, ie. extract all characters within the round brackets and paste them together using ,, one option is rm_round from qdapRegex. The output of rm_round is a list. So we use lapply/sapply to loop through the list. Strings that have , inside are separated with grep and we paste the round brackets, and then paste the strings together with collapse=', '. A convenient wrapper function is toString.
library(qdapRegex)
df1$allSymbol <- sapply(rm_round(df1[,2],extract=TRUE), function(x) {
indx <- grep(',', x)
x[indx] <-paste0("(", x[indx], ")")
toString(x)})
is.na(df1$allSymbol) <- df1$allSymbol=='NA'
df1[3:4]
# allSymbol Symbol
#1 ubiquinone, (9kD, MLRQ), NDUFA4 NDUFA4
#2 MRPS33 MRPS33
#3 FDFT1 FDFT1
#4 RPS11 RPS11
#5 oligomycin sensitivity conferring protein, ATP5O ATP5O
#6 CMAS CMAS
#7 HNRPF HNRPF
#8 RPL34 RPL34
#9 subunit 9, ATP5G3 ATP5G3
#10 RPS15A RPS15A
#11 LOC84528 LOC84528
#12 <NA> <NA>
I think I took the easy way out, but if you can get away with it, only match the things in the parentheses that look like a gene symbol, ie, only capital letters and digits
dd <- structure(list(ID = 1:12, Gene.Name = structure(c(3L, 11L, 9L, 5L, 1L, 8L, 2L, 4L, 6L, 12L, 10L, 7L), .Label = c(" ATP synt, H+ tran, O subunit (oligomycin sensitivity conferring protein) (ATP5O), mRNA", " heterogeneous nuclear ribonucleoprotein F (HNRPF), mRNA", " NADH (ubiquinone) 1 alpha subcomplex, 4 (9kD, MLRQ) (NDUFA4), mRNA", " ribosomal protein L34 (RPL34), transcript variant 1, mRNA", " ribosomal protein S11 (RPS11), mRNA", "ATP synthase, H+ tran, mitochondrial F0, subunit c (subunit 9) isoform 3 (ATP5G3), mRNA", "clone MGC:10120 IMAGE:3900723, mRNA, complete cds", "cytidine monophosphate N-acetylneuraminic acid synthetase (CMAS), mRNA", "farnesyl-diphosphate farnesyltransferase 1 (FDFT1), mRNA", "homeobox protein from AL590526 (LOC84528), mRNA", "mitochondrial S33 (MRPS33), transcript variant 1, nuclear gene, mRNA", "ribosomal protein S15a (RPS15A), mRNA"), class = "factor")), .Names = c("ID", "Gene.Name"), row.names = c(NA, -12L), class = "data.frame")
dd$Gene.Name <- as.character(dd$Gene.Name)
## case 1
mm <- gregexpr('(?<=\\()(.*?)(?=\\))', dd$Gene.Name, perl = TRUE)
mm <- regmatches(dd$Gene.Name, mm)
dd <- cbind(dd, case1 = sapply(mm, function(x)
ifelse(length(x), paste(x, collapse = ', '), NA)))
dd[, c(1,3)]
# ID case1
# 1 1 ubiquinone, 9kD, MLRQ, NDUFA4
# 2 2 MRPS33
# 3 3 FDFT1
# 4 4 RPS11
# 5 5 oligomycin sensitivity conferring protein, ATP5O
# 6 6 CMAS
# 7 7 HNRPF
# 8 8 RPL34
# 9 9 subunit 9, ATP5G3
# 10 10 RPS15A
# 11 11 LOC84528
# 12 12 <NA>
## case 2
mm <- gregexpr('(?<=\\()([A-Z0-9]+)(?=\\))', dd$Gene.Name, perl = TRUE)
mm <- regmatches(dd$Gene.Name, mm)
dd <- cbind(dd, case2 = sapply(mm, function(x) ifelse(length(x), x, NA)))
dd[, c(1,4)]
# ID case2
# 1 1 NDUFA4
# 2 2 MRPS33
# 3 3 FDFT1
# 4 4 RPS11
# 5 5 ATP5O
# 6 6 CMAS
# 7 7 HNRPF
# 8 8 RPL34
# 9 9 ATP5G3
# 10 10 RPS15A
# 11 11 LOC84528
# 12 12 <NA>
I have a column as follows in a dataframe called PeakBoundaries:
chrom
chr11:69464719-69502928
chr7:55075808-55093954
chr8:128739772-128762863
chr3:169389459-169490555
chr17:37848534-37877201
chr19:30306758-30316875
chr1:150496857-150678056
chr12:69183279-69260755
chr11:77610143-77641464
chr8:38191804-38260814
chr12:58135797-58156509
I would like to separate out the columns so that the columns look like below in a dataframe:
chr chrStart chrEnd
chr11 69464719 69502928
chr7 55075808 55093954
chr8 128739772 128762863
chr3 169389459 169490555
etc.
I have tried a regular expression approach but am not getting anywhere in terms of getting the match to enter into a new column:
PeakBoundaries$chrOnly <- PeakBoundaries[grep("\\w+?=\\:"),PeakBoundaries$chrom]
I am met with the error:
Error in [.data.frame(PeakBoundaries, grep("\w+?=\:"), PeakBoundaries$chrom) :
undefined columns selected
Try this - no regex needed, just the strsplit function:
dat <- read.table(text="chr11:69464719-69502928
chr7:55075808-55093954
chr8:128739772-128762863
chr3:169389459-169490555
chr17:37848534-37877201
chr19:30306758-30316875
chr1:150496857-150678056
chr12:69183279-69260755
chr11:77610143-77641464
chr8:38191804-38260814
chr12:58135797-58156509", stringsAsFactors=FALSE)
dat[,2:4] <- matrix(unlist(strsplit(dat[,1],split = "\\:|\\-")), ncol=3, byrow=TRUE)
colnames(dat) <- c("chrom", "chr", "chrStart", "chrEnd")
# Convert last two columns from character to numeric:
dat$chrStart <- as.numeric(dat$chrStart)
dat$chrEnd <- as.numeric(dat$chrEnd)
Results
> res
chrom chr chrStart chrEnd
1 chr11:69464719-69502928 chr11 69464719 69502928
2 chr7:55075808-55093954 chr7 55075808 55093954
3 chr8:128739772-128762863 chr8 128739772 128762863
4 chr3:169389459-169490555 chr3 169389459 169490555
5 chr17:37848534-37877201 chr17 37848534 37877201
6 chr19:30306758-30316875 chr19 30306758 30316875
7 chr1:150496857-150678056 chr1 150496857 150678056
8 chr12:69183279-69260755 chr12 69183279 69260755
9 chr11:77610143-77641464 chr11 77610143 77641464
10 chr8:38191804-38260814 chr8 38191804 38260814
11 chr12:58135797-58156509 chr12 58135797 58156509
Edit
You could do everything using only your existing dataframe. Replace dat[,1] with PeakBoundaries$chrom and dat[,2:4] with PeakBoundaries[,(ncol(PeakBoundaries)+1):(ncol(PeakBoundaries)+3)] and you should have it!
Edit By OP
OK so I think there's something a bit odd with my dataset but I've sorted it with Dominic's help so that it it is now:
PeakBoundaries <- as.data.frame(PeakBoundaries)
PeakBoundaries <- PeakBoundaries[-1,]
PeakBoundaries <- as.data.frame(PeakBoundaries)
PeakBoundaries$PeakBoundaries <-
as.character(PeakBoundaries$PeakBoundaries)
PeakBoundaries[,(ncol(PeakBoundaries)+1):(ncol(PeakBoundaries)+3)] <-
matrix(unlist(strsplit(PeakBoundaries$PeakBoundaries,
split = "\\:|\\-")), ncol=3, byrow=TRUE)
A shorter version of Dominic's answer, making the insertion a one-liner:
dat <- data.frame(chrom = readLines(textConnection("chr11:69464719-69502928
chr7:55075808-55093954
chr8:128739772-128762863
chr3:169389459-169490555
chr17:37848534-37877201
chr19:30306758-30316875
chr1:150496857-150678056
chr12:69183279-69260755
chr11:77610143-77641464
chr8:38191804-38260814
chr12:58135797-58156509")) )
dat[, c('chr','chrStart','chrEnd')] <- t( sapply( dat$chrom, function(s) { str_split(s, '[:-]') [[1]] } ) )
dat$chrStart <- as.numeric(dat$chrStart)
dat$chrEnd <- as.numeric(dat$chrEnd)
We could try
library(tidyr)
extract(dat, chrom, into=c('chr', 'chrStart', 'chrEnd'),
'([^:]+):([^-]+)-(.*)', convert=TRUE)
# chr chrStart chrEnd
#1 chr11 69464719 69502928
#2 chr7 55075808 55093954
#3 chr8 128739772 128762863
#4 chr3 169389459 169490555
#5 chr17 37848534 37877201
#6 chr19 30306758 30316875
#7 chr1 150496857 150678056
#8 chr12 69183279 69260755
#9 chr11 77610143 77641464
#10 chr8 38191804 38260814
#11 chr12 58135797 58156509
Or a faster option using the devel version of data.table. We can install the v1.9.5 from here
library(data.table) # v1.9.5+
nm1 <- c('chr', 'chrStart', 'chrEnd')
res <- setDT(tstrsplit(dat$chrom, '[:-]', type.convert=TRUE))
setnames(res, nm1)
res
# chr chrStart chrEnd
# 1: chr11 69464719 69502928
# 2: chr7 55075808 55093954
# 3: chr8 128739772 128762863
# 4: chr3 169389459 169490555
# 5: chr17 37848534 37877201
# 6: chr19 30306758 30316875
# 7: chr1 150496857 150678056
# 8: chr12 69183279 69260755
# 9: chr11 77610143 77641464
#10: chr8 38191804 38260814
#11: chr12 58135797 58156509
Or
library(splitstackshape)
setnames(cSplit(dat, 'chrom', ':|-',fixed=FALSE,
type.convert=TRUE), nm1)[]
data
dat <- structure(list(chrom = structure(c(2L, 9L, 10L, 8L, 6L, 7L, 1L,
5L, 3L, 11L, 4L), .Label = c("chr1:150496857-150678056",
"chr11:69464719-69502928",
"chr11:77610143-77641464", "chr12:58135797-58156509",
"chr12:69183279-69260755",
"chr17:37848534-37877201", "chr19:30306758-30316875",
"chr3:169389459-169490555",
"chr7:55075808-55093954", "chr8:128739772-128762863",
"chr8:38191804-38260814"
), class = "factor")), .Names = "chrom", row.names = c(NA, -11L
), class = "data.frame")
Dear R user community,
I have many data.frames in a list, as follows (only one data.frame in the list of 21 shown for convenience):
> str(datal)
List of 21
$ BallitoRaw.DAT :'data.frame': 1083 obs. of 3 variables:
..$ Filename: Factor w/ 21 levels "BallitoRaw.DAT",..: 1 1 1 1 1 1 1 1 1 1 ...
..$ date :Class 'Date' num [1:1083] 7318 7319 7320 7321 7322 ...
..$ temp : num [1:1083] NA 25.8 NA NA NA NA NA NA NA 24.4 ...
If I work on each data.frame in the list individually I can create a zoo object from temp and date, as such:
> BallitoRaw.zoo <- zoo(datal$BallitoRaw.DAT$temp, datal$BallitoRaw.DAT$date)
The zoo object looks like this:
> head(BallitoRaw.zoo)
1990-01-14 1990-01-15 1990-01-16 1990-01-17 1990-01-18 1990-01-19
NA 25.8 NA NA NA NA
How do I use llply or apply (or similar) to work on the whole list at once?
The output needs to go into a new list of data.frames, or a series of independent data.frames (each one named as in the zoo example above). Note that the date column, although a regular time series (days), contains missing dates (in addition to NAs for temps of existing dates); the missing dates will be filled by the zoo function. The output data.frame with the zoo object will thus be longer than the original one.
Help kindly appreciated.
makeNamedZoo <- function(dfrm){ dfrmname <- deparse(substitute(dfrm))
zooname <-dfrmname
assign(zooname, zoo(dfrm$temp, dfrm$date))
return(get(zooname)) }
ListOfZoos <- lapply(dflist, makeNamedZoo)
names(ListOfZoos) <- paste( sub("DAT$", "", names(dflist) ), "zoo", sep="")
Here is a simple test case:
df1 <- data.frame(a= letters[1:10], date=as.Date("2011-01-01")+0:9, temp=rnorm(10) )
df2 <- data.frame(a= letters[1:10], date=as.Date("2011-01-01")+0:9, temp=rnorm(10) )
dflist <- list(dfone.DAT=df1,dftwo.DAT=df2)
ListOfZoos <- lapply(dflist, makeNamedZoo)
names(ListOfZoos) <- paste( sub("DAT$", "", names(dflist) ), "zoo", sep="")
$dfone.zoo
2011-01-01 2011-01-02 2011-01-03 2011-01-04 2011-01-05 2011-01-06 2011-01-07
0.7869056 1.6523928 -1.1131432 1.2261783 1.1843587 0.2673762 -0.4159968
2011-01-08 2011-01-09 2011-01-10
-1.2686391 -0.4135859 -1.4916291
$dftwo.zoo
2011-01-01 2011-01-02 2011-01-03 2011-01-04 2011-01-05 2011-01-06 2011-01-07
0.7356612 -0.1263861 -1.6901240 -0.6441732 -1.4675871 2.3006544 1.0263354
2011-01-08 2011-01-09 2011-01-10
-0.8577544 0.6079986 0.6625564
This is an easier way to achieve what I needed:
tozoo <- function(x) zoo(x$temp, x$date)
data1.zoo <- do.call(merge, lapply(split(data1, data1$Filename), tozoo))
The result is a nice zoo object.