Select columns of data.table based on regex

Select columns of data.table based on regex - regex

How can I select columns of a data.table based on a regex?
Consider a simple example as follows:
library(data.table)
mydt <- data.table(foo=c(1,2), bar=c(2,3), baz=c(3,4))
Is there a way to use columns of bar and baz from the datatable based on a regex? I know that the following solution works but if the table is much bigger and I would like to choose more variables this could easily get cumbersome.
mydt[, .(bar, baz)]
I would like to have something like matches() in dplyr::select() but only by reference.

You can also try to use %like% from data.table package, which is a "convenience function for calling regexpr". However makes code more readable ;)
In this case, answering your question:
mydt[, .SD, .SDcols = names(mydt) %like% "bar|baz"]
As %like% returns a logical vector, whe can use the following to get every column except those which contain "foo":
mydt[, .SD, .SDcols = ! names(mydt) %like% "foo"]
where !negates the logical vector.

David's answer will work. But if your regex is long and you would rather it be done first, try:
cols <- grep("<regex pattern>", names(mydt), value=T)
mydt[, cols, with=FALSE]
It just depends on your preferences and needs. You can also assign the subsetted table to a chosen variable if you need the original intact.

UPDATE: I updated the comparison with #sindri_baldur's answer - using version 1.12.6. According to the results, patterns() is a handy shortcut, but if performance matters, one should stick with the .. or with = FALSE solution (see below).
Apparently, there is a new way of achieving this from version 1.10.2 onwards.
library(data.table)
cols <- grep("bar|baz", names(mydt), value = TRUE)
mydt[, ..cols]
It seems to work the fastest out of the posted solutions.
# Creating a large data.table with 100k rows, 32 columns
n <- 100000
foo_cols <- paste0("foo", 1:30)
big_dt <- data.table(bar = rnorm(n), baz = rnorm(n))
big_dt[, (foo_cols) := rnorm(n)]
# Methods
subsetting <- function(dt) {
subset(dt, select = grep("bar|baz", names(dt)))
}
usingSD <- function(dt) {
dt[, .SD, .SDcols = names(dt) %like% "bar|baz"]
}
usingWith <- function(dt) {
cols <- grep("bar|baz", names(dt), value = TRUE)
dt[, cols, with = FALSE]
}
usingDotDot <- function(dt) {
cols <- grep("bar|baz", names(dt), value = TRUE)
dt[, ..cols]
}
usingPatterns <- function(dt) {
dt[, .SD, .SDcols = patterns("bar|baz")]
}
# Benchmark
microbenchmark(
subsetting(big_dt), usingSD(big_dt), usingWith(big_dt), usingDotDot(big_dt), usingPatterns(big_dt),
times = 5000
)
#Unit: microseconds
# expr min lq mean median uq max neval
# subsetting(big_dt) 430 759 1672 1309 1563 82934 5000
# usingSD(big_dt) 547 951 1872 1461 1797 60357 5000
# usingWith(big_dt) 278 496 1331 1112 1304 62656 5000
# usingDotDot(big_dt) 289 483 1392 1117 1344 55878 5000
# usingPatterns(big_dt) 596 1019 1984 1518 1913 120331 5000

Since data.table v1.12.0 (Jan 2019) you can do:
mydt[, .SD, .SDcols = patterns("bar|baz")]
From the official documentation ?data.table, on the .SDcols argument:
[...] you can filter columns to include in .SD based on their names according to regular
expressions via .SDcols=patterns(regex1, regex2, ...). The included
columns will be the intersection of the columns identified by each
pattern; pattern unions can easily be specified with | in a regex. [...] You
can also invert a pattern as usual with .SDcols = !patterns(...).

There is also a subset method for "data.table", so you can always use something like the following:
subset(mydt, select = grep("bar|baz", names(mydt)))
# bar baz
# 1: 2 3
# 2: 3 4
It turns out that creating a startswith type of function for "data.table" is not very straightforward.

I suggest this one-liner code for readability and performance.
mydt[,names(mydt) %like% "bar|baz", with=F]
Following #Janosdivenji's answer:
See usingLikeon the last row
Unit: microseconds
expr min lq mean median uq max neval
subsetting(big_dt) 370.582 977.2760 1194.875 1016.4340 1096.9285 25750.94 5000
usingSD(big_dt) 554.330 1084.8530 1352.039 1133.4575 1226.9060 189905.39 5000
usingWith(big_dt) 238.481 832.7505 1017.051 866.6515 927.8460 22717.83 5000
usingDotDot(big_dt) 256.005 844.8770 1101.543 878.9935 936.6040 181855.43 5000
usingPatterns(big_dt) 569.787 1128.0970 1411.510 1178.2895 1282.2265 177415.23 5000
usingLike(big_dt) 262.868 852.5805 1059.466 887.3455 948.6665 23971.70 5000

Related

How to very efficiently extract specific pattern from characters?

I have big data like this :
> Data[1:7,1]
[1] mature=hsa-miR-5087|mir_Family=-|Gene=OR4F5
[2] mature=hsa-miR-26a-1-3p|mir_Family=mir-26|Gene=OR4F9
[3] mature=hsa-miR-448|mir_Family=mir-448|Gene=OR4F5
[4] mature=hsa-miR-659-3p|mir_Family=-|Gene=OR4F5
[5] mature=hsa-miR-5197-3p|mir_Family=-|Gene=OR4F5
[6] mature=hsa-miR-5093|mir_Family=-|Gene=OR4F5
[7] mature=hsa-miR-650|mir_Family=mir-650|Gene=OR4F5
what I want to do is that, in every row, I want to select the name after word mature= and also the word after Gene= and then pater them together with
paste(a,b, sep="-")
for example, the expected output from first two rows would be like :
hsa-miR-5087-OR4F5
hsa-miR-26a-1-3p-OR4F9
so, the final implementation is like this:
for(i in 1:nrow(Data)){
Data[i,3] <- sub("mature=([^|]*).*Gene=(.*)", "\\1-\\2", Data[i,1])
Name <- strsplit(as.vector(Data[i,2]),"\\|")[[1]][2]
Data[i,4] <- as.numeric(sub("pvalue=","",Name))
print(i)
}
which work well, but it's very slow. the size of Data is very big and it has 200,000,000 rows. this implementation is very slow for that. how can I speed it up ?

If you can guarantee that the format is exactly as you specified, then a regular expression can capture (denoted by the brackets below) everything from the equals sign upto the pipe symbol, and from the Gene= to the end, and paste them together with a minus sign:
sub("mature=([^|]*).*Gene=(.*)", "\\1-\\2", Data[,1])

Another option is to use read.table with = as a separator then pasting the 2 columns:
res = read.table(text=txt,sep='=')
paste(sub('[|].*','',res$V2), ## get rid from last part here
sub('^ +| +$','',res$V4),sep='-') ## remove extra spaces
[1] "hsa-miR-5087-OR4F5" "hsa-miR-26a-1-3p-OR4F9" "hsa-miR-448-OR4F5" "hsa-miR-659-3p-OR4F5"
[5] "hsa-miR-5197-3p-OR4F5" "hsa-miR-5093-OR4F5" "hsa-miR-650-OR4F5"

The simple sub solution already given looks quite nice but just in case here are some other approaches:
1) read.pattern Using read.pattern in the gsubfn package we can parse the data into a data.frame. This intermediate form, DF, can then be manipulated in many ways. In this case we use paste in essentially the same way as in the question:
library(gsubfn)
DF <- read.pattern(text = Data[, 1], pattern = "(\\w+)=([^|]*)")
paste(DF$V2, DF$V6, sep = "-")
giving:
[1] "hsa-miR-5087-OR4F5" "hsa-miR-26a-1-3p-OR4F9" "hsa-miR-448-OR4F5"
[4] "hsa-miR-659-3p-OR4F5" "hsa-miR-5197-3p-OR4F5" "hsa-miR-5093-OR4F5"
[7] "hsa-miR-650-OR4F5"
The intermediate data frame, DF, that was produced looks like this:
> DF
V1 V2 V3 V4 V5 V6
1 mature hsa-miR-5087 mir_Family - Gene OR4F5
2 mature hsa-miR-26a-1-3p mir_Family mir-26 Gene OR4F9
3 mature hsa-miR-448 mir_Family mir-448 Gene OR4F5
4 mature hsa-miR-659-3p mir_Family - Gene OR4F5
5 mature hsa-miR-5197-3p mir_Family - Gene OR4F5
6 mature hsa-miR-5093 mir_Family - Gene OR4F5
7 mature hsa-miR-650 mir_Family mir-650 Gene OR4F5
Here is a visualization of the regular expression we used:
(\w+)=([^|]*)
Debuggex Demo
1a) names We could make DF look nicer by reading the three columns of data and the three names separately. This also improves the paste statement:
DF <- read.pattern(text = Data[, 1], pattern = "=([^|]*)")
names(DF) <- unlist(read.pattern(text = Data[1,1], pattern = "(\\w+)=", as.is = TRUE))
paste(DF$mature, DF$Gene, sep = "-") # same answer as above
The DF in this section that was produced looks like this. It has 3 instead of 6 columns and remaining columns were used to determine appropriate column names:
> DF
mature mir_Family Gene
1 hsa-miR-5087 - OR4F5
2 hsa-miR-26a-1-3p mir-26 OR4F9
3 hsa-miR-448 mir-448 OR4F5
4 hsa-miR-659-3p - OR4F5
5 hsa-miR-5197-3p - OR4F5
6 hsa-miR-5093 - OR4F5
7 hsa-miR-650 mir-650 OR4F5
2) strapplyc
Another approach using the same package. This extracts the fields coming after a = and not containing a | producing a list. We then sapply over that list pasting the first and third fields together:
sapply(strapplyc(Data[, 1], "=([^|]*)"), function(x) paste(x[1], x[3], sep = "-"))
giving the same result.
Here is a visualization of the regular expression used:
=([^|]*)
Debuggex Demo

Here is one approach:
Data <- readLines(n = 7)
mature=hsa-miR-5087|mir_Family=-|Gene=OR4F5
mature=hsa-miR-26a-1-3p|mir_Family=mir-26|Gene=OR4F9
mature=hsa-miR-448|mir_Family=mir-448|Gene=OR4F5
mature=hsa-miR-659-3p|mir_Family=-|Gene=OR4F5
mature=hsa-miR-5197-3p|mir_Family=-|Gene=OR4F5
mature=hsa-miR-5093|mir_Family=-|Gene=OR4F5
mature=hsa-miR-650|mir_Family=mir-650|Gene=OR4F5
df <- read.table(sep = "|", text = Data, stringsAsFactors = FALSE)
l <- lapply(df, strsplit, "=")
trim <- function(x) gsub("^\\s*|\\s*$", "", x)
paste(trim(sapply(l[[1]], "[", 2)), trim(sapply(l[[3]], "[", 2)), sep = "-")
# [1] "hsa-miR-5087-OR4F5" "hsa-miR-26a-1-3p-OR4F9" "hsa-miR-448-OR4F5" "hsa-miR-659-3p-OR4F5" "hsa-miR-5197-3p-OR4F5" "hsa-miR-5093-OR4F5"
# [7] "hsa-miR-650-OR4F5"

Maybe not the more elegant but you can try :
sapply(Data[,1],function(x){
parts<-strsplit(x,"\\|")[[1]]
y<-paste(gsub("(mature=)|(Gene=)","",parts[grepl("mature|Gene",parts)]),collapse="-")
return(y)
})
Example
Data<-data.frame(col1=c("mature=hsa-miR-5087|mir_Family=-|Gene=OR4F5","mature=hsa-miR-26a-1-3p|mir_Family=mir-26|Gene=OR4F9"),col2=1:2,stringsAsFactors=F)
> Data[,1]
[1] "mature=hsa-miR-5087|mir_Family=-|Gene=OR4F5" "mature=hsa-miR-26a-1-3p|mir_Family=mir-26|Gene=OR4F9"
> sapply(Data[,1],function(x){
+ parts<-strsplit(x,"\\|")[[1]]
+ y<-paste(gsub("(mature=)|(Gene=)","",parts[grepl("mature|Gene",parts)]),collapse="-")
+ return(y)
+ })
mature=hsa-miR-5087|mir_Family=-|Gene=OR4F5 mature=hsa-miR-26a-1-3p|mir_Family=mir-26|Gene=OR4F9
"hsa-miR-5087-OR4F5" "hsa-miR-26a-1-3p-OR4F9"

R: Fast string split on first delimiter occurence

I have a file with ~ 40 million rows that I need to split based on the first comma delimiter.
The following using the stringr function str_split_fixed works well but is very slow.
library(data.table)
library(stringr)
df1 <- data.frame(id = 1:1000, letter1 = rep(letters[sample(1:25,1000, replace = T)], 40))
df1$combCol1 <- paste(df1$id, ',',df1$letter1, sep = '')
df1$combCol2 <- paste(df1$combCol1, ',', df1$combCol1, sep = '')
st1 <- str_split_fixed(df1$combCol2, ',', 2)
Any suggestions for a faster way to do this?

Update
The stri_split_fixed function in more recent versions of "stringi" have a simplify argument that can be set to TRUE to return a matrix. Thus, the updated solution would be:
stri_split_fixed(df1$combCol2, ",", 2, simplify = TRUE)
Original answer (with updated benchmarks)
If you are comfortable with the "stringr" syntax and don't want to veer too far from it, but you also want to benefit from a speed boost, try the "stringi" package instead:
library(stringr)
library(stringi)
system.time(temp1 <- str_split_fixed(df1$combCol2, ',', 2))
# user system elapsed
# 3.25 0.00 3.25
system.time(temp2a <- do.call(rbind, stri_split_fixed(df1$combCol2, ",", 2)))
# user system elapsed
# 0.04 0.00 0.05
system.time(temp2b <- stri_split_fixed(df1$combCol2, ",", 2, simplify = TRUE))
# user system elapsed
# 0.01 0.00 0.01
Most of the "stringr" functions have "stringi" parallels, but as can be seen from this example, the "stringi" output required one extra step of binding the data to create the output as a matrix instead of as a list.
Here's how it compares with #RichardScriven's suggestion in the comments:
fun1a <- function() do.call(rbind, stri_split_fixed(df1$combCol2, ",", 2))
fun1b <- function() stri_split_fixed(df1$combCol2, ",", 2, simplify = TRUE)
fun2 <- function() {
do.call(rbind, regmatches(df1$combCol2, regexpr(",", df1$combCol2),
invert = TRUE))
}
library(microbenchmark)
microbenchmark(fun1a(), fun1b(), fun2(), times = 10)
# Unit: milliseconds
# expr min lq mean median uq max neval
# fun1a() 42.72647 46.35848 59.56948 51.94796 69.29920 98.46330 10
# fun1b() 17.55183 18.59337 20.09049 18.84907 22.09419 26.85343 10
# fun2() 370.82055 404.23115 434.62582 439.54923 476.02889 480.97912 10

How to separate the variables of a particular column in a CSV file and write to a CSV file in R?

I have a CSV file like
Market,CampaignName,Identity
Wells Fargo,Gary IN MetroChicago IL Metro,56
EMC,Los Angeles CA MetroBoston MA Metro,78
Apple,Cupertino CA Metro,68
Desired Output to a CSV file with the first row as the headers
Market,City,State,Identity
Wells Fargo,Gary,IN,56
Wells Fargo,Chicago,IL,56
EMC,Los Angeles,CA,78
EMC,Boston,MA,78
Apple,Cupertino,CA,68
res <-
gsub('(.*) ([A-Z]{2})*Metro (.*) ([A-Z]{2}) .*','\\1,\\2:\\3,\\4',
xx$Market)
How to modify the above regular expressions to get the result in R?
New to R, any help is appreciated.

library(stringr)
xx.to.split <- with(xx, setNames(gsub("Metro", "", as.character(CampaignName)), Market))
do.call(rbind, str_match_all(xx.to.split, "(.+?) ([A-Z]{2}) ?"))[, -1]
Produces:
[,1] [,2]
Wells Fargo "Gary" "IN"
Wells Fargo "Chicago" "IL"
EMC "Los Angeles" "CA"
EMC "Boston" "MA"
Apple "Cupertino" "CA"
This should work even if you have different number of Compaign Names in each market. Unfortunately I think base options are annoying to implement because frustratingly there isn't a gregexec, although I'd be curious if someone comes up with something comparably compact in base.

Here is a solution using base R. Split the CampaignName column on the string Metro adding sequential numbers as names. stack turns it into a data frame with columns ind and values which we massage into DF1. Merge that with xx by the sequence numbers of DF1 and the row numbers of xx. Move Market to the front of DF2 and remove ind and CampaignName. Finally write it out.
xx <- read.csv("Campaign.csv", as.is = TRUE)
s <- strsplit(xx$CampaignName, " Metro")
names(s) <- seq_along(s)
ss <- stack(s)
DF1 <- with(ss, data.frame(ind,
City = sub(" ..$", "", values),
State = sub(".* ", "", values)))
DF2 <- merge(DF1, xx, by.x = "ind", by.y = 0)
DF <- DF2[ c("Market", setdiff(names(DF2), c("ind", "Market", "CampaignName"))) ]
write.csv(DF, file = "myfile.csv", row.names = FALSE, quote = FALSE)
REVISED to handle extra columns after poster modified the question to include such. Minor improvements.

How to add column to data.table with values from list based on regex

I have the following data.table:
id fShort
1 432-12 1245
2 3242-12 453543
3 324-32 45543
4 322-34 45343
5 2324-34 13543
DT <- data.table(
id=c("432-12", "3242-12", "324-32", "322-34", "2324-34"),
fShort=c("1245", "453543", "45543", "45343", "13543"))
and the following list:
filenames <- list("3242-124342345.png", "432-124343.png", "135-13434.jpeg")
I would like to create a new column "fComplete" that includes the complete filename from the list. For this the values of column "id" need to be matched with the filename-list. If the filename starts with the "id" string, the complete filename should be returned. I use the following regex
t <- grep("432-12","432-124343.png",value=T)
that return the correct filename.
This is how the final table should look like:
id fShort fComplete
1 432-12 1245 432-124343.png
2 3242-12 453543 3242-124342345.png
3 324-32 45543 NA
4 322-34 45343 NA
5 2324-34 13543 NA
DT2 <- data.table(
id=c("432-12", "3242-12", "324-32", "322-34", "2324-34"),
fshort=c("1245", "453543", "45543", "45343", "13543"),
fComplete = c("432-124343.png", "3242-124342345.png", NA, NA, NA))
I tried using apply and data.table approaches but I always get warnings like
argument 'pattern' has length > 1 and only the first element will be used
What is a simple approach to accomplish this?

Here's a data.table solution:
DT[ , fComplete := lapply(id, function(x) {
m <- grep(x, filenames, value = TRUE)
if (!length(m)) NA else m})]
id fShort fComplete
1: 432-12 1245 432-124343.png
2: 3242-12 453543 3242-124342345.png
3: 324-32 45543 NA
4: 322-34 45343 NA
5: 2324-34 13543 NA

In my experience with similar functions, sometimes the regex functions return a list, so you have to consider that in the apply - I usually do an example manually
Also apply will not always in y experience on its own return something that always works into a data.frame,sometimes I had to use lap ply, and or unlist and data.frame to modify it
Here is an answer - I am not familiar with data.tables and I was having issues with the filenames being in a list, but with some transformations this works. I worked it out by seeing what apply was outputting and adding the [1] to get the piece I needed
DT <- data.frame(
id=c("432-12", "3242-12", "324-32", "322-34", "2324-34"),
fShort=c("1245", "453543", "45543", "45343", "13543"))
filenames <- list("3242-124342345.png", "432-124343.png", "135-13434.jpeg")
filenames1 <- unlist(filenames)
x<-apply(DT[1],1,function(x) grep(x,filenames1)[1])
DT$fielname <- filenames1[x]

Extract all numbers from a single string in R

Let's imagine you have a string:
strLine <- "The transactions (on your account) were as follows: 0 3,000 (500) 0 2.25 (1,200)"
Is there a function that strips out the numbers into an array/vector producing the following required solution:
result <- c(0, 3000, -500, 0, 2.25, -1200)?
i.e.
result[3] = -500
Notice, the numbers are presented in accounting form so negative numbers appear between (). Also, you can assume that only numbers appear to the right of the first occurance of a number. I am not that good with regexp so would appreciate it if you could help if this would be required. Also, I don't want to assume the string is always the same so I am looking to strip out all words (and any special characters) before the location of the first number.

library(stringr)
x <- str_extract_all(strLine,"\\(?[0-9,.]+\\)?")[[1]]
> x
[1] "0" "3,000" "(500)" "0" "2.25" "(1,200)"
Change the parens to negatives:
x <- gsub("\\((.+)\\)","-\\1",x)
x
[1] "0" "3,000" "-500" "0" "2.25" "-1,200"
And then as.numeric() or taRifx::destring to finish up (the next version of destring will support negatives by default so the keep option won't be necessary):
library(taRifx)
destring( x, keep="0-9.-")
[1] 0 3000 -500 0 2.25 -1200
OR:
as.numeric(gsub(",","",x))
[1] 0 3000 -500 0 2.25 -1200

Here's the base R way, for the sake of completeness...
x <- unlist(regmatches(strLine, gregexpr('\\(?[0-9,.]+', strLine)))
x <- as.numeric(gsub('\\(', '-', gsub(',', '', x)))
[1] 0.00 3000.00 -500.00 0.00 2.25 -1200.00

What for me worked perfectly when working on single strings in a data frame (One string per row in same column) was the following:
library(taRifx)
DataFrame$Numbers<-as.character(destring(DataFrame$Strings, keep="0-9.-"))
The results are in a new column from the same data frame.

Since this came up in another question, this is an uncrutched stringi solution (vs the stringr crutch):
as.numeric(
stringi::stri_replace_first_fixed(
stringi::stri_replace_all_regex(
unlist(stringi::stri_match_all_regex(
"The transactions (on your account) were as follows: 0 3,000 (500) 0 2.25 (1,200)",
"\\(?[0-9,.]+\\)?"
)), "\\)$|,", ""
),
"(", "-"
)
)

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Select columns of data.table based on regex - regex

There is also a subset method for "data.table", so you can always use something like the following: subset(mydt, select = grep("bar|baz", names(mydt))) # bar baz # 1: 2 3 # 2: 3 4 It turns out that creating a startswith type of function for "data.table" is not very straightforward.

Related

How to very efficiently extract specific pattern from characters?

R: Fast string split on first delimiter occurence

How to separate the variables of a particular column in a CSV file and write to a CSV file in R?

How to add column to data.table with values from list based on regex

Extract all numbers from a single string in R

Categories

Resources