split string for specific column - regex

I have a file like this:
V1 V2
1 1-500891 CGCGACCTCAGATCAGACGTGGCGACCCGCTGAA
2 2-280976 AGGTTCCGGATAAGTAAGAGCC
3 3-223181 TCTTAACCCGGACCAGAAACTA
I would like to split (and swap) the V1 column resulting in the following output
Sequence Count
CGCGACCTCAGATCAGACGTGGCGACCCGCTGAA 500891
AGGTTCCGGATAAGTAAGAGCC 280976
TCTTAACCCGGACCAGAAACTA 223181
I have tried this, but it did not work:
df_split <- strsplit(as.character(df), split="-", fixed=T)

You can try sub to remove the part of the string up till -.
df$V1 <- sub('.*-', '', df$V1)
df
# V1 V2
#1 500891 CGCGACCTCAGATCAGACGTGGCGACCCGCTGAA
#2 280976 AGGTTCCGGATAAGTAAGAGCC
#3 223181 TCTTAACCCGGACCAGAAACTA
You applied the strsplit on the whole dataset instead of specific column ("V1"). Here, is a possible option for you to consider
df$V1 <- sapply(strsplit(as.character(df$V1),
split="-", fixed=TRUE),`[`,2)
df$V1
#[1] "500891" "280976" "223181"
Or an option using tidyr
library(tidyr)
extract(df, 'V1', 'Count', '.*-(.*)')
# Count V2
#1 500891 CGCGACCTCAGATCAGACGTGGCGACCCGCTGAA
#2 280976 AGGTTCCGGATAAGTAAGAGCC
#3 223181 TCTTAACCCGGACCAGAAACTA

Related

Transform '1-1-1 through 1-10-1' to ten values of '1-1-1', '1-2-1',... '1-10-1'

I have a dataframe like below:
ID
1-1-1, 1-2-1
2-1-1
3-1-1 through 3-5-1
I am looking to transform the dataframe to
ID
1-1-1
1-2-1
2-1-1
3-1-1
3-2-1
3-3-1
3-4-1
3-5-1
For first row in the first dataframe, I think melt can do the job. But for third row, I think I should somehow substitute the 'through' to IDs in between. I tried some regular expression but did not find a good way to do so.
Following Question:
what if there is another column and I want to match them?
NewColumn ID
A 1-1-1, 1-2-1
B 2-1-1
C 3-1-1 through 3-5-1
to
NewColumn ID
A 1-1-1
A 1-2-1
B 2-1-1
C 3-1-1
C 3-2-1
C 3-3-1
C 3-4-1
C 3-5-1
first digid in ID could be the same for multiple New Columns items.
We could do this using cSplit from splitstackshape and data.table approaches after we replace the through with , using sub.
Using regex sub, we match if there is zero or more space (\\s*) followed by through followed by zero or more space (\\s*) and replace that it with , for the 'ID' column.
df1$ID <- sub('\\s*through\\s*', ', ', df1$ID)
Now we use cSplit to split the 'ID' column using delimiter as , and specifying the direction as 'long'. The output is still non-numeric. So, if we wanted to make a sequence, it is better to split that into 'numeric'. We do a second cSplit using - as delimiter and the default direction as 'wide'. We get three columns. Now, we can use the data.table approaches. We can group by the 'ID_1' and 'ID_3' columns and check if the number of elements (.N) in the group is >1 or not. If there are multiple elements, we get the sequence of the first and last element (here there is only two elements, so 1st and 2nd i.e. of the ID_2 column, and finally paste the three columns together and create a 'data.frame'.
library(splitstackshape)
library(data.table)
ID <- cSplit(cSplit(df1, 'ID', ', ', 'long'), 'ID', '-', type.convert=TRUE)[,
list(ID_2=if(.N>1) ID_2[1]:ID_2[2] else ID_2), by = .(ID_1, ID_3)
][, paste(ID_1, ID_2, ID_3, sep="-")]
d1 <- data.frame(ID, stringsAsFactors=FALSE)
d1
#ID
#1 1-1-1
#2 1-2-1
#3 2-1-1
#4 3-1-1
#5 3-2-1
#6 3-3-1
#7 3-4-1
#8 3-5-1
For easier understanding, the code can be split into chunks. We split based on the ', ' to create a 'long' format
cLong <- cSplit(df1, 'ID', ', ', 'long')
In the next step, it is split on '-' and we use the option type.convert=TRUE to convert the columns to their respective classes.
cLong1 <- cSplit(cLong, 'ID', '-', type.convert=TRUE)
Now, we use data.table approach as the output from cSplit is already a 'data.table'
DT1 <- cLong1[, list(ID_2=if(.N>1)
ID_2[1]:ID_2[2]
else ID_2),
by = .(ID_1, ID_3)]
We paste the columns together
ID <- do.call(paste, c(DT1[,c(1,3,2), with=FALSE], sep='-'))
and create a 'data.frame'
data.frame(ID)
Update
For the follow up question, we only need to change in the cSplit step. We can add 'NewColumn' as the grouping variable.
df1$ID <- sub('\\s*through\\s*', ', ', df1$ID)
cSplit(cSplit(df1, 'ID', ', ', 'long'), 'ID', '-',
type.convert=TRUE)[, list(ID_2=if(.N>1) ID_2[1]:ID_2[2] else ID_2),
by = .(NewColumn, ID_1, ID_3)
][,list(ID=paste(ID_1, ID_2, ID_3, sep="-")) ,.(NewColumn)]
# NewColumn ID
#1: A 1-1-1
#2: A 1-2-1
#3: B 2-1-1
#4: C 3-1-1
#5: C 3-2-1
#6: C 3-3-1
#7: C 3-4-1
#8: C 3-5-1
data
df1 <- structure(list(ID = c("1-1-1, 1-2-1", "2-1-1",
"3-1-1 through 3-5-1")), .Names = "ID", class = "data.frame",
row.names = c(NA, -3L))
#newdata
df1 <- structure(list(NewColumn = c("A", "B", "C"),
ID = c("1-1-1, 1-2-1",
"2-1-1", "3-1-1 through 3-5-1")), .Names = c("NewColumn", "ID"
), class = "data.frame", row.names = c(NA, -3L))

Subset all 3 digit numbers and collapse them with a separator in a data frame. R

I'm formating a data set so each entry has the adegenet format for codominant markers, such as:
Loci1
###/###
208/210
200/204
198/208
where the # represents any digit (the number is a allele size in basepairs). My data has some homozygous entries (all 3 digit integers with no separator) that have the the form of:
Loci1
###
208
198
I intend to paste the 3 digit string to itself with sep='/' to produce the first format. I've tried to use grep to subset these homozygous entries by finding all non ###/### and negating the match using the table matching such as:
a <- grep('\\b\\d{3}?[/]\\d{3}', score$Loci1, value =T ) # Subset all ###/###/
score[!(a %in% 1:nrow(score$Loci1)), ] # works but only on vectors...
After the subset I could paste. The problem arises when I apply this to a data frame. grep seems to treat the data frame as a list (which in part it is) and returns columns that have a match.
So in short how can I go from ### to ###/### in a data frame
self contained example of data:
score2 <- NULL
set.seed(9)
Loci1 <- NULL
Loci2 <- NULL
Loci3 <- NULL
for (i in 1:5) Loci1 <- append(Loci1, paste(sample(seq(from = 230, to=330, by=3), 2, replace = F), collapse = '/'))
for (i in 1:5) Loci2 <- append(Loci2, paste(sample(seq(from = 230, to=330, by=3), 2, replace = F), collapse = '/'))
for (i in 1:5) Loci3 <- append(Loci3, paste(sample(seq(from = 230, to=330, by=3), 2, replace = F), collapse = '/'))
score2 <- data.frame(Loci1, Loci2, Loci3, stringsAsFactors = F)
score2[2,3] <- strsplit(score2[2,3], split = '/')[1]
score2[5,2] <- strsplit(score2[3,3], split = '/')[1]
score2[1,1] <- strsplit(score2[1,1], split = '/')[1]
score2[c(1, 4),c(2,3)] <- NA
score2
You could just replace the 3 digit items with the separator and a copy:
sub("^(...)$", "\\1/\\1", Loci1)
Use lapply with an anonymized function:
data.frame( lapply(score2, function(x) sub("^(...)$", "\\1/\\1", x) ) )
Loci1 Loci2 Loci3
1 251/251 <NA> <NA>
2 251/329 320/257 260/260
3 275/242 278/329 281/320
4 269/266 <NA> <NA>
5 296/326 281/281 326/314
(Not sure what the "paste-part" was supposed to refer to, but I think this was the intent of your question)
If the numeric values could have a varying number of digits then use a pattern argument like "^([0-9]{1,9})$"
An option using grep/paste,
m1 <- as.matrix(score2)
indx <- grep('^...$', m1)
m1[indx] <- paste(m1[indx], m1[indx], sep="/")
as.data.frame(m1)
# Loci1 Loci2 Loci3
#1 251/251 <NA> <NA>
#2 251/329 320/257 260/260
#3 275/242 278/329 281/320
#4 269/266 <NA> <NA>
#5 296/326 281/281 326/314
Or without converting to matrix, this can be done using lapply
score2[] <- lapply(score2, function(x) ifelse(grepl('^...$', x),
paste(x, x, sep="/"),x))

How to very efficiently extract specific pattern from characters?

I have big data like this :
> Data[1:7,1]
[1] mature=hsa-miR-5087|mir_Family=-|Gene=OR4F5
[2] mature=hsa-miR-26a-1-3p|mir_Family=mir-26|Gene=OR4F9
[3] mature=hsa-miR-448|mir_Family=mir-448|Gene=OR4F5
[4] mature=hsa-miR-659-3p|mir_Family=-|Gene=OR4F5
[5] mature=hsa-miR-5197-3p|mir_Family=-|Gene=OR4F5
[6] mature=hsa-miR-5093|mir_Family=-|Gene=OR4F5
[7] mature=hsa-miR-650|mir_Family=mir-650|Gene=OR4F5
what I want to do is that, in every row, I want to select the name after word mature= and also the word after Gene= and then pater them together with
paste(a,b, sep="-")
for example, the expected output from first two rows would be like :
hsa-miR-5087-OR4F5
hsa-miR-26a-1-3p-OR4F9
so, the final implementation is like this:
for(i in 1:nrow(Data)){
Data[i,3] <- sub("mature=([^|]*).*Gene=(.*)", "\\1-\\2", Data[i,1])
Name <- strsplit(as.vector(Data[i,2]),"\\|")[[1]][2]
Data[i,4] <- as.numeric(sub("pvalue=","",Name))
print(i)
}
which work well, but it's very slow. the size of Data is very big and it has 200,000,000 rows. this implementation is very slow for that. how can I speed it up ?
If you can guarantee that the format is exactly as you specified, then a regular expression can capture (denoted by the brackets below) everything from the equals sign upto the pipe symbol, and from the Gene= to the end, and paste them together with a minus sign:
sub("mature=([^|]*).*Gene=(.*)", "\\1-\\2", Data[,1])
Another option is to use read.table with = as a separator then pasting the 2 columns:
res = read.table(text=txt,sep='=')
paste(sub('[|].*','',res$V2), ## get rid from last part here
sub('^ +| +$','',res$V4),sep='-') ## remove extra spaces
[1] "hsa-miR-5087-OR4F5" "hsa-miR-26a-1-3p-OR4F9" "hsa-miR-448-OR4F5" "hsa-miR-659-3p-OR4F5"
[5] "hsa-miR-5197-3p-OR4F5" "hsa-miR-5093-OR4F5" "hsa-miR-650-OR4F5"
The simple sub solution already given looks quite nice but just in case here are some other approaches:
1) read.pattern Using read.pattern in the gsubfn package we can parse the data into a data.frame. This intermediate form, DF, can then be manipulated in many ways. In this case we use paste in essentially the same way as in the question:
library(gsubfn)
DF <- read.pattern(text = Data[, 1], pattern = "(\\w+)=([^|]*)")
paste(DF$V2, DF$V6, sep = "-")
giving:
[1] "hsa-miR-5087-OR4F5" "hsa-miR-26a-1-3p-OR4F9" "hsa-miR-448-OR4F5"
[4] "hsa-miR-659-3p-OR4F5" "hsa-miR-5197-3p-OR4F5" "hsa-miR-5093-OR4F5"
[7] "hsa-miR-650-OR4F5"
The intermediate data frame, DF, that was produced looks like this:
> DF
V1 V2 V3 V4 V5 V6
1 mature hsa-miR-5087 mir_Family - Gene OR4F5
2 mature hsa-miR-26a-1-3p mir_Family mir-26 Gene OR4F9
3 mature hsa-miR-448 mir_Family mir-448 Gene OR4F5
4 mature hsa-miR-659-3p mir_Family - Gene OR4F5
5 mature hsa-miR-5197-3p mir_Family - Gene OR4F5
6 mature hsa-miR-5093 mir_Family - Gene OR4F5
7 mature hsa-miR-650 mir_Family mir-650 Gene OR4F5
Here is a visualization of the regular expression we used:
(\w+)=([^|]*)
Debuggex Demo
1a) names We could make DF look nicer by reading the three columns of data and the three names separately. This also improves the paste statement:
DF <- read.pattern(text = Data[, 1], pattern = "=([^|]*)")
names(DF) <- unlist(read.pattern(text = Data[1,1], pattern = "(\\w+)=", as.is = TRUE))
paste(DF$mature, DF$Gene, sep = "-") # same answer as above
The DF in this section that was produced looks like this. It has 3 instead of 6 columns and remaining columns were used to determine appropriate column names:
> DF
mature mir_Family Gene
1 hsa-miR-5087 - OR4F5
2 hsa-miR-26a-1-3p mir-26 OR4F9
3 hsa-miR-448 mir-448 OR4F5
4 hsa-miR-659-3p - OR4F5
5 hsa-miR-5197-3p - OR4F5
6 hsa-miR-5093 - OR4F5
7 hsa-miR-650 mir-650 OR4F5
2) strapplyc
Another approach using the same package. This extracts the fields coming after a = and not containing a | producing a list. We then sapply over that list pasting the first and third fields together:
sapply(strapplyc(Data[, 1], "=([^|]*)"), function(x) paste(x[1], x[3], sep = "-"))
giving the same result.
Here is a visualization of the regular expression used:
=([^|]*)
Debuggex Demo
Here is one approach:
Data <- readLines(n = 7)
mature=hsa-miR-5087|mir_Family=-|Gene=OR4F5
mature=hsa-miR-26a-1-3p|mir_Family=mir-26|Gene=OR4F9
mature=hsa-miR-448|mir_Family=mir-448|Gene=OR4F5
mature=hsa-miR-659-3p|mir_Family=-|Gene=OR4F5
mature=hsa-miR-5197-3p|mir_Family=-|Gene=OR4F5
mature=hsa-miR-5093|mir_Family=-|Gene=OR4F5
mature=hsa-miR-650|mir_Family=mir-650|Gene=OR4F5
df <- read.table(sep = "|", text = Data, stringsAsFactors = FALSE)
l <- lapply(df, strsplit, "=")
trim <- function(x) gsub("^\\s*|\\s*$", "", x)
paste(trim(sapply(l[[1]], "[", 2)), trim(sapply(l[[3]], "[", 2)), sep = "-")
# [1] "hsa-miR-5087-OR4F5" "hsa-miR-26a-1-3p-OR4F9" "hsa-miR-448-OR4F5" "hsa-miR-659-3p-OR4F5" "hsa-miR-5197-3p-OR4F5" "hsa-miR-5093-OR4F5"
# [7] "hsa-miR-650-OR4F5"
Maybe not the more elegant but you can try :
sapply(Data[,1],function(x){
parts<-strsplit(x,"\\|")[[1]]
y<-paste(gsub("(mature=)|(Gene=)","",parts[grepl("mature|Gene",parts)]),collapse="-")
return(y)
})
Example
Data<-data.frame(col1=c("mature=hsa-miR-5087|mir_Family=-|Gene=OR4F5","mature=hsa-miR-26a-1-3p|mir_Family=mir-26|Gene=OR4F9"),col2=1:2,stringsAsFactors=F)
> Data[,1]
[1] "mature=hsa-miR-5087|mir_Family=-|Gene=OR4F5" "mature=hsa-miR-26a-1-3p|mir_Family=mir-26|Gene=OR4F9"
> sapply(Data[,1],function(x){
+ parts<-strsplit(x,"\\|")[[1]]
+ y<-paste(gsub("(mature=)|(Gene=)","",parts[grepl("mature|Gene",parts)]),collapse="-")
+ return(y)
+ })
mature=hsa-miR-5087|mir_Family=-|Gene=OR4F5 mature=hsa-miR-26a-1-3p|mir_Family=mir-26|Gene=OR4F9
"hsa-miR-5087-OR4F5" "hsa-miR-26a-1-3p-OR4F9"

converting some rows into columns in R

I have a table with 1 columns and I want to extract one among the other elements in rows and insert into new column.
lets say my table: df
V1
elements-of-01-to-20
ACTCTGCGACHCHAHAATT
elements-of-21-to-30
ACTAGCTATTATCGATATT
elements-of-31-to-40
CCCTTATATTGGAGCTACT
my desired result:
V1 V2
elements-of-01-to-20 ACTCTGCGACHCHAHAATT
elements-of-21-to-20 ACTAGCTATTATCGATATT
elements-of-31-to-20 CCCTTATATTGGAGCTACT
elements-of-31-to-40 CCCTTATATTGGAGCTACT
edited:
thanks for all replies. my second question is what if my dataset has multiple sequences followed by specific term like elements-of:
V1 => result => V1 v2
elements-of-01-to-20 elements-of-01-to-20 ACTCTGCGACHCHAHAATTAGGGGATGCTGATTTAGTA
ACTCTGCGACHCHAHAATT elements-of-21-to-30 ACTAGCTATTATCGATATT
AGGGGATGCTGATTTAGTA
elements-of-21-to-30
ACTAGCTATTATCGATATT
If the pattern is the same as in the example
indx <- c(TRUE, FALSE)
data.frame(V1=df$V1[indx], V2=df$V1[!indx])
# V1 V2
#1 elements-of-01-to-20 ACTCTGCGACHCHAHAATT
#2 elements-of-21-to-30 ACTAGCTATTATCGATATT
#3 elements-of-31-to-40 CCCTTATATTGGAGCTACT
Update
Based on the updated dataset
library(data.table)
setDT(df)[,list(V1=V1[1], V2=paste(V1[-1], collapse='')),
by= list(indx=cumsum(grepl('^[^A-Z]', df$V1)))][, indx:=NULL][]
# V1 V2
#1: elements-of-01-to-20 ACTCTGCGACHCHAHAATTAGGGGATGCTGATTTAGTA
#2: elements-of-21-to-30 ACTAGCTATTATCGATATT
New data
df <- structure(list(V1 = c("elements-of-01-to-20", "ACTCTGCGACHCHAHAATT",
"AGGGGATGCTGATTTAGTA", "elements-of-21-to-30", "ACTAGCTATTATCGATATT"
)), .Names = "V1", class = "data.frame", row.names = c(NA, -5L))
If that is just a fasta file then look at the Biostrings package. You could do it this way too
MySeq <- data.frame("Name" = df$V1[(seq(1, length(df$V1), by=2)],
"Seq" = df$V1[(seq(2, length(df$V1), by=2)],
stringsAsFactors = FALSE)
Here is another way using grepl:
#dummy data
df <- read.table(text=" V1
elements-of-01-to-20
ACTCTGCGACHCHAHAATT
elements-of-21-to-30
ACTAGCTATTATCGATATT
elements-of-31-to-40
CCCTTATATTGGAGCTACT",
as.is=TRUE,header=TRUE)
#result
cbind(df[ grepl("elements",df$V1), "V1"],
df[ !grepl("elements",df$V1), "V1"])
#output
# [,1] [,2]
# [1,] "elements-of-01-to-20" "ACTCTGCGACHCHAHAATT"
# [2,] "elements-of-21-to-30" "ACTAGCTATTATCGATATT"
# [3,] "elements-of-31-to-40" "CCCTTATATTGGAGCTACT"
Try (using traditional programming methods):
ndf = data.frame(V1="", V2="", stringsAsFactors=FALSE)
i=1
while(i<nrow(df)){
ndf[(nrow(ndf)+1),]=c(df[i,1], df[(i+1),1])
i=i+2
}
ndf[-1,]
V1 V2
2 elements-of-01-to-20 ACTCTGCGACHCHAHAATT
3 elements-of-21-to-30 ACTAGCTATTATCGATATT
4 elements-of-31-to-40 CCCTTATATTGGAGCTACT

How to add column to data.table with values from list based on regex

I have the following data.table:
id fShort
1 432-12 1245
2 3242-12 453543
3 324-32 45543
4 322-34 45343
5 2324-34 13543
DT <- data.table(
id=c("432-12", "3242-12", "324-32", "322-34", "2324-34"),
fShort=c("1245", "453543", "45543", "45343", "13543"))
and the following list:
filenames <- list("3242-124342345.png", "432-124343.png", "135-13434.jpeg")
I would like to create a new column "fComplete" that includes the complete filename from the list. For this the values of column "id" need to be matched with the filename-list. If the filename starts with the "id" string, the complete filename should be returned. I use the following regex
t <- grep("432-12","432-124343.png",value=T)
that return the correct filename.
This is how the final table should look like:
id fShort fComplete
1 432-12 1245 432-124343.png
2 3242-12 453543 3242-124342345.png
3 324-32 45543 NA
4 322-34 45343 NA
5 2324-34 13543 NA
DT2 <- data.table(
id=c("432-12", "3242-12", "324-32", "322-34", "2324-34"),
fshort=c("1245", "453543", "45543", "45343", "13543"),
fComplete = c("432-124343.png", "3242-124342345.png", NA, NA, NA))
I tried using apply and data.table approaches but I always get warnings like
argument 'pattern' has length > 1 and only the first element will be used
What is a simple approach to accomplish this?
Here's a data.table solution:
DT[ , fComplete := lapply(id, function(x) {
m <- grep(x, filenames, value = TRUE)
if (!length(m)) NA else m})]
id fShort fComplete
1: 432-12 1245 432-124343.png
2: 3242-12 453543 3242-124342345.png
3: 324-32 45543 NA
4: 322-34 45343 NA
5: 2324-34 13543 NA
In my experience with similar functions, sometimes the regex functions return a list, so you have to consider that in the apply - I usually do an example manually
Also apply will not always in y experience on its own return something that always works into a data.frame,sometimes I had to use lap ply, and or unlist and data.frame to modify it
Here is an answer - I am not familiar with data.tables and I was having issues with the filenames being in a list, but with some transformations this works. I worked it out by seeing what apply was outputting and adding the [1] to get the piece I needed
DT <- data.frame(
id=c("432-12", "3242-12", "324-32", "322-34", "2324-34"),
fShort=c("1245", "453543", "45543", "45343", "13543"))
filenames <- list("3242-124342345.png", "432-124343.png", "135-13434.jpeg")
filenames1 <- unlist(filenames)
x<-apply(DT[1],1,function(x) grep(x,filenames1)[1])
DT$fielname <- filenames1[x]