R: Fast string split on first delimiter occurence - regex

I have a file with ~ 40 million rows that I need to split based on the first comma delimiter.
The following using the stringr function str_split_fixed works well but is very slow.
library(data.table)
library(stringr)
df1 <- data.frame(id = 1:1000, letter1 = rep(letters[sample(1:25,1000, replace = T)], 40))
df1$combCol1 <- paste(df1$id, ',',df1$letter1, sep = '')
df1$combCol2 <- paste(df1$combCol1, ',', df1$combCol1, sep = '')
st1 <- str_split_fixed(df1$combCol2, ',', 2)
Any suggestions for a faster way to do this?

Update
The stri_split_fixed function in more recent versions of "stringi" have a simplify argument that can be set to TRUE to return a matrix. Thus, the updated solution would be:
stri_split_fixed(df1$combCol2, ",", 2, simplify = TRUE)
Original answer (with updated benchmarks)
If you are comfortable with the "stringr" syntax and don't want to veer too far from it, but you also want to benefit from a speed boost, try the "stringi" package instead:
library(stringr)
library(stringi)
system.time(temp1 <- str_split_fixed(df1$combCol2, ',', 2))
# user system elapsed
# 3.25 0.00 3.25
system.time(temp2a <- do.call(rbind, stri_split_fixed(df1$combCol2, ",", 2)))
# user system elapsed
# 0.04 0.00 0.05
system.time(temp2b <- stri_split_fixed(df1$combCol2, ",", 2, simplify = TRUE))
# user system elapsed
# 0.01 0.00 0.01
Most of the "stringr" functions have "stringi" parallels, but as can be seen from this example, the "stringi" output required one extra step of binding the data to create the output as a matrix instead of as a list.
Here's how it compares with #RichardScriven's suggestion in the comments:
fun1a <- function() do.call(rbind, stri_split_fixed(df1$combCol2, ",", 2))
fun1b <- function() stri_split_fixed(df1$combCol2, ",", 2, simplify = TRUE)
fun2 <- function() {
do.call(rbind, regmatches(df1$combCol2, regexpr(",", df1$combCol2),
invert = TRUE))
}
library(microbenchmark)
microbenchmark(fun1a(), fun1b(), fun2(), times = 10)
# Unit: milliseconds
# expr min lq mean median uq max neval
# fun1a() 42.72647 46.35848 59.56948 51.94796 69.29920 98.46330 10
# fun1b() 17.55183 18.59337 20.09049 18.84907 22.09419 26.85343 10
# fun2() 370.82055 404.23115 434.62582 439.54923 476.02889 480.97912 10

Related

how do you parse a sysmon file to extract certain information using R?

I am trying to read bunch of this type of files using R to parse out the information and put the data in a data frame like format:
this is the contents of the file:
last_run current_run seconds
------------------------------- ------------------------------- -----------
Jul 4 2016 7:17AM Jul 4 2016 7:21AM 226
Engine Utilization (Tick %) User Busy System Busy I/O Busy Idle
------------------------- ------------ ------------ ---------- ----------
ThreadPool : syb_default_pool
Engine 0 5.0 % 0.4 % 22.4 % 72.1 %
Engine 1 3.9 % 0.5 % 22.8 % 72.8 %
Engine 2 5.6 % 0.3 % 22.5 % 71.6 %
Engine 3 5.1 % 0.4 % 22.7 % 71.8 %
------------------------- ------------ ------------ ---------- ----------
Pool Summary Total 336.1 % 25.6 % 1834.6 % 5803.8 %
Average 4.2 % 0.3 % 22.9 % 72.5 %
------------------------- ------------ ------------ ---------- ----------
Server Summary Total 336.1 % 25.6 % 1834.6 % 5803.8 %
Average 4.2 % 0.3 % 22.9 % 72.5 %
Transaction Profile
-------------------
Transaction Summary per sec per xact count % of total
------------------------- ------------ ------------ ---------- ----------
Committed Xacts 137.3 n/a 41198 n/a
Average Runnable Tasks 1 min 5 min 15 min % of total
------------------------- ------------ ------------ ---------- ----------
ThreadPool : syb_default_pool
Global Queue 0.0 0.0 0.0 0.0 %
Engine 0 0.0 0.1 0.1 0.6 %
Engine 1 0.0 0.0 0.0 0.0 %
Engine 2 0.2 0.1 0.1 2.6 %
------------------------- ------------ ------------ ----------
Pool Summary Total 7.2 5.9 6.1
Average 0.1 0.1 0.1
------------------------- ------------ ------------ ----------
Server Summary Total 7.2 5.9 6.1
Average 0.1 0.1 0.1
Device Activity Detail
----------------------
Device:
/dev/vx/rdsk/sybaserdatadg/datadev_125
datadev_125 per sec per xact count % of total
------------------------- ------------ ------------ ---------- ----------
Total I/Os 0.0 0.0 0 n/a
------------------------- ------------ ------------ ---------- ----------
Total I/Os 0.0 0.0 0 0.0 %
-----------------------------------------------------------------------------
Device:
/dev/vx/rdsk/sybaserdatadg/datadev_126
datadev_126 per sec per xact count % of total
------------------------- ------------ ------------ ---------- ----------
Total I/Os 0.0 0.0 0 n/a
------------------------- ------------ ------------ ---------- ----------
Total I/Os 0.0 0.0 0 0.0 %
-----------------------------------------------------------------------------
Device:
/dev/vx/rdsk/sybaserdatadg/datadev_127
datadev_127 per sec per xact count % of total
------------------------- ------------ ------------ ---------- ----------
Reads
APF 0.0 0.0 5 0.4 %
Non-APF 0.0 0.0 1 0.1 %
Writes 3.8 0.0 1128 99.5 %
------------------------- ------------ ------------ ---------- ----------
Total I/Os 3.8 0.0 1134 0.1 %
Mirror Semaphore Granted 3.8 0.0 1134 100.0 %
Mirror Semaphore Waited 0.0 0.0 0 0.0 %
-----------------------------------------------------------------------------
Device:
/sybaser/database/sybaseR/dev/sybaseR.datadev_000
GPS_datadev_000 per sec per xact count % of total
------------------------- ------------ ------------ ---------- ----------
Reads
APF 7.9 0.0 2372 55.9 %
Non-APF 5.5 0.0 1635 38.6 %
Writes 0.8 0.0 233 5.5 %
------------------------- ------------ ------------ ---------- ----------
Total I/Os 14.1 0.0 4240 0.3 %
Mirror Semaphore Granted 14.1 0.0 4239 100.0 %
Mirror Semaphore Waited 0.0 0.0 2 0.0 %
I need to capture "Jul 4 2016 7:21AM" as Date,
from "Engine Utilization (Tick%) line, Server Summary ->Average "4.2%"
From "Transaction Profile" section ->Transaction Profile "count" entry.
so, my data frame should look something like this:
Date Cpu Count
Jul 4 2016 7:21AM 4.2 41198
Can somebody help me how to parse this file to get these output?
I have tried something like this:
read.table(text=readLines("file.txt")[count.fields("file.txt", blank.lines.skip=FALSE) == 9])
to get this line:
Average 4.2 % 0.3 % 22.9 % 72.5 %
But I want to be able to only extract Average right after
Engine Utilization (Tick %), since there could be many lines that start with Average. The Average line that shows up right after Engine Utilization (Tick %), is the one I want.
How do I put that in this line to extract this information from this file:
read.table(text=readLines("file.txt")[count.fields("file.txt", blank.lines.skip=FALSE) == 9])
Can I use grep in this read.table line to search for certain characters?
%%%% Shot 1 -- got something working
extract <- function(filenam="file.txt"){
txt <- readLines(filenam)
## date of current run:
## assumed to be on 2nd line following the first line matching "current_run"
ii <- 2 + grep("current_run",txt, fixed=TRUE)[1]
line_current_run <- Filter(function(v) v!="", strsplit(txt[ii]," ")[[1]])
date_current_run <- paste(line_current_run[5:8], collapse=" ")
## Cpu:
## assumed to be on line following the first line matching "Server Summary"
## which comes after the first line matching "Engine Utilization ..."
jj <- grep("Engine Utilization (Tick %)", txt, fixed=TRUE)[1]
ii <- grep("Server Summary",txt, fixed=TRUE)
ii <- 1 + min(ii[ii>jj])
line_Cpu <- Filter(function(v) v!="", strsplit(txt[ii]," ")[[1]])
Cpu <- line_Cpu[2]
## Count:
## assumed to be on 2nd line following the first line matching "Transaction Summary"
ii <- 2 + grep("Transaction Summary",txt, fixed=TRUE)[1]
line_count <- Filter(function(v) v!="", strsplit(txt[ii]," ")[[1]])
count <- line_count[5]
data.frame(Date=date_current_run, Cpu=Cpu, Count=count, stringsAsFactors=FALSE)
}
print(extract("file.txt"))
##file.list <- dir("./")
file.list <- rep("file.txt",3)
merged <- do.call("rbind", lapply(file.list, extract))
print(merged)
file.list <- rep("file.txt",2000)
print(system.time(merged <- do.call("rbind", lapply(file.list, extract))))
## runs in about 2.5 secs on my laptop
%%% Shot 2: 1st attempt to extract a (potentially variable) number of device columns
extractv2 <- function(filenam="file2.txt"){
txt <- readLines(filenam)
## date of current run:
## assumed to be on 2nd line following the first line matching "current_run"
ii <- 2 + grep("current_run",txt, fixed=TRUE)[1]
line_current_run <- Filter(function(v) v!="", strsplit(txt[ii]," ")[[1]])
date_current_run <- paste(line_current_run[5:8], collapse=" ")
## Cpu:
## assumed to be on line following the first line matching "Server Summary"
## which comes after the first line matching "Engine Utilization ..."
jj <- grep("Engine Utilization (Tick %)", txt, fixed=TRUE)[1]
ii <- grep("Server Summary",txt, fixed=TRUE)
ii <- 1 + min(ii[ii>jj])
line_Cpu <- Filter(function(v) v!="", strsplit(txt[ii]," ")[[1]])
Cpu <- line_Cpu[2]
## Count:
## assumed to be on 2nd line following the first line matching "Transaction Summary"
ii <- 2 + grep("Transaction Summary",txt, fixed=TRUE)[1]
line_count <- Filter(function(v) v!="", strsplit(txt[ii]," ")[[1]])
count <- line_count[5]
## Total I/Os
## 1. Each line "Device:" is assumed to be the header of a block of lines
## containing info about a single device (there are 4 such blocks
## in your example);
## 2. each block is assumed to contain one or more lines matching
## "Total I/Os";
## 3. the relevant count data is assumed to be contained in the last
## of such lines (at column 4), for each block.
## Approach: loop on the line numbers of those lines matching "Device:"
## to get: A. counts; B. device names
ii_block_dev <- grep("Device:", txt, fixed=TRUE)
ii_lines_IOs <- grep("Total I/Os", txt, fixed=TRUE)
nblocks <- length(ii_block_dev)
## A. get counts for each device
## for each block, select *last* line matching "Total I/Os"
ii_block_dev_aux <- c(ii_block_dev, Inf) ## just a hack to get a clean code
ii_lines_IOs_dev <- sapply(1:nblocks, function(block){
## select matching liens to "Total I/Os" within each block
IOs_per_block <- ii_lines_IOs[ ii_lines_IOs > ii_block_dev_aux[block ] &
ii_lines_IOs < ii_block_dev_aux[block+1]
]
tail(IOs_per_block, 1) ## get the last line of each block (if more than one match)
})
lines_IOs <- lapply(txt[ii_lines_IOs_dev], function(strng){
Filter(function(v) v!="", strsplit(strng," ")[[1]])
})
IOs_counts <- sapply(lines_IOs, function(v) v[5])
## B. get device names:
## assumed to be on lines following each "Device:" match
ii_devices <- 1 + ii_block_dev
device_names <- sapply(ii_devices, function(ii){
Filter(function(v) v!="", strsplit(txt[ii]," ")[[1]])
})
## Create a data.frame with "device_names" as column names and "IOs_counts" as
## the values of a single row.
## Sorting the device names by order() will help produce the same column names
## if different sysmon files list the devices in different order
ord <- order(device_names)
devices <- as.data.frame(structure(as.list(IOs_counts[ord]), names=device_names[ord]),
check.names=FALSE) ## Prevent R from messing with our device names
data.frame(stringsAsFactors=FALSE, check.names=FALSE,
Date=date_current_run, Cpu=Cpu, Count=count, devices)
}
print(extractv2("file2.txt"))
## WATCH OUT:
## merging will ONLY work if all devices have the same names across sysmon files!!
file.list <- rep("file2.txt",3)
merged <- do.call("rbind", lapply(file.list, extractv2))
print(merged)
%%%%%%% Shot 3: extract two tables, one with a single row, and a second with a variable number of rows (depending on the which devices are listed in each sysmon file).
extractv3 <- function(filenam="file2.txt"){
txt <- readLines(filenam)
## date of current run:
## assumed to be on 2nd line following the first line matching "current_run"
ii <- 2 + grep("current_run",txt, fixed=TRUE)[1]
line_current_run <- Filter(function(v) v!="", strsplit(txt[ii]," ")[[1]])
date_current_run <- paste(line_current_run[5:8], collapse=" ")
## Cpu:
## assumed to be on line following the first line matching "Server Summary"
## which comes after the first line matching "Engine Utilization ..."
jj <- grep("Engine Utilization (Tick %)", txt, fixed=TRUE)[1]
ii <- grep("Server Summary",txt, fixed=TRUE)
ii <- 1 + min(ii[ii>jj])
line_Cpu <- Filter(function(v) v!="", strsplit(txt[ii]," ")[[1]])
Cpu <- line_Cpu[2]
## Count:
## assumed to be on 2nd line following the first line matching "Transaction Summary"
ii <- 2 + grep("Transaction Summary",txt, fixed=TRUE)[1]
line_count <- Filter(function(v) v!="", strsplit(txt[ii]," ")[[1]])
count <- line_count[5]
## first part of output: fixed three-column structure
fixed <- data.frame(stringsAsFactors=FALSE,
Date=date_current_run, Cpu=Cpu, Count=count)
## Total I/Os
## 1. Each line "Device:" is assumed to be the header of a block of lines
## containing info about a single device (there are 4 such blocks
## in your example);
## 2. each block is assumed to contain one or more lines matching
## "Total I/Os";
## 3. the relevant count data is assumed to be contained in the last
## of such lines (at column 4), for each block.
## Approach: loop on the line numbers of those lines matching "Device:"
## to get: A. counts; B. device names
ii_block_dev <- grep("Device:", txt, fixed=TRUE)
if(length(ii_block_dev)==0){
variable <- data.frame(stringsAsFactors=FALSE,
date_current_run=date_current_run,
device_names=NA, IOs_counts=NA)
}else{
ii_lines_IOs <- grep("Total I/Os", txt, fixed=TRUE)
nblocks <- length(ii_block_dev)
if(length(ii_block_dev)==0){
sprintf("WEIRD datapoint at date %s: I have %d devices but 0 I/O lines??")
##stop()
}
## A. get counts for each device
## for each block, select *last* line matching "Total I/Os"
ii_block_dev_aux <- c(ii_block_dev, Inf) ## just a hack to get a clean code
ii_lines_IOs_dev <- sapply(1:nblocks, function(block){
## select matching lines to "Total I/Os" within each block
IOs_per_block <- ii_lines_IOs[ ii_lines_IOs > ii_block_dev_aux[block ] &
ii_lines_IOs < ii_block_dev_aux[block+1]
]
tail(IOs_per_block, 1) ## get the last line of each block (if more than one match)
})
lines_IOs <- lapply(txt[ii_lines_IOs_dev], function(strng){
Filter(function(v) v!="", strsplit(strng," ")[[1]])
})
IOs_counts <- sapply(lines_IOs, function(v) v[5])
## B. get device names:
## assumed to be on lines following each "Device:" match
ii_devices <- 1 + ii_block_dev
device_names <- sapply(ii_devices, function(ii){
Filter(function(v) v!="", strsplit(txt[ii]," ")[[1]])
})
## Create a data.frame with three columns: date, device, counts
variable <- data.frame(stringsAsFactors=FALSE,
date_current_run=rep(date_current_run, length(IOs_counts)),
device_names=device_names, IOs_counts=IOs_counts)
}
list(fixed=fixed, variable=variable)
}
print(extractv3("file2.txt"))
file.list <- c("file.txt","file2.txt","file3.txt")
res <- lapply(file.list, extractv3)
fixed.merged <- do.call("rbind", lapply(res, function(r) r$fixed))
print(fixed.merged)
variable.merged <- do.call("rbind", lapply(res, function(r) r$variable))
print(variable.merged)
Manipulating text files can sometimes be easier using dedicated programs. E.g. gawk is specifically designed for finding patterns in text files and outputting data from them. We can use a short gawk script to get the required data to load into R. Note, each line of the script consists of a pattern to look for, followed by an action to take enclosed in{}. NR is a counter that counts number of lines read so far.
BEGIN {OFS = ""; ORS = ""}
/current_run/ {dat_line = NR+2; cpu_done = 0}
/Server Summary/ {cpu_line = NR+1}
/Transaction Summary/ {cnt_line = NR+2}
NR == dat_line {print "'",$5," ",$6," ",$7," ",$8,"' "}
NR == cpu_line && cpu_done==0 {print $2," "; cpu_done = 1}
NR == cnt_line {print $5,"\n"}
Save this script with the name "ext.awk", then extract all the data files into an R data frame (assuming they are all located in one folder and have the extension .txt) with
df <- read.table(text=system("gawk -f ext.awk *.txt", T), col.names = c("Date","Cpu","Count"))
NB, gawk comes ready installed on most Linux versions. On windows you may need to install it from http://gnuwin32.sourceforge.net/packages/gawk.htm
For reading the files
Here I am assuming CSV as file type.
For others please visit
http://www.r-tutor.com/r-introduction/data-frame/data-import
>utilization <- read.csv(file="",head=TRUE)
>serverSummary <-read.csv(file="",head=TRUE)
>transcProfile <- read.csv(file="",head=TRUE)
==>merge only accepts two arguments
>data <- merge(utilization,serverSummary)
>dataframe <-merge(data,transcProfile)
now you will have all the columns in dataframe
>dataframe
u can see all the columns in dataframe
Extarct the columns as per required
==>The subset( ) function is the easiest way to select variables and observations
>subset(dataframe,select=c("last_run","Average","Transaction Profile")
Now you can write it to CSV or any file type
>write.csv(dataframe, file = "MyData.csv")
For merging all the files together
multmerge = function(mypath){
filenames=list.files(path=mypath, full.names=TRUE)
datalist = lapply(filenames, function(x){read.csv(file=x,header=T)})
Reduce(function(x,y) {merge(x,y)}, datalist)
After running the code to define the function, you are all set to use it. The function takes a path. This path should be the name of a folder that contains all of the files you would like to read and merge together and only those files you would like to merge. With this in mind, I have two tips:
Before you use this function, my suggestion is to create a new folder in a short directory (for example, the path for this folder could be “C://R//mergeme“) and save all of the files you would like to merge in that folder.
In addition, make sure that the column that will do the matching is formatted the same way (and has the same name) in each of the files.
Suppose you saved your 20 files into the mergeme folder at “C://R//mergeme” and you would like to read and merge them. To use my function, you use the following syntax:
mymergeddata = multmerge(“C://R//mergeme”)
After running this command, you have a fully merged data frame with all of your variables matched to each other
Now you can subset the dataframe as per required columns.
Use readLines or stringi::stri_read_lines to read the contents of the file as a character vector. The latter is typically faster, but not as mature, and occasionally breaks on unusual content.
lines <- readLines("the file name")
For fast regular expresssion matching, stringi is typically the best choice. rebus.datetimes allows you you to generate a regular expression from a strptime date format string.
Finding the current run date
The line of which current_run appears is found with:
library(stringi)
library(rebus.datetimes)
i_current_run <- which(stri_detect_fixed(lines, "current_run"))
To extract the dates, this code only looks at the 2nd line after the one where current run is found, but the code is vectorizable, so you can easily look at all the lines if you have files where that assumption doesn't hold.
date_format <- "%b%t%d%t%Y%t%H:%M%p"
rx_date <- rebus.datetimes::datetime(date_format, io = "input")
extracted_dates <- stri_extract_all_regex(lines[i_current_run + 2], rx_date)
current_run_date <- strptime(
extracted_dates[[1]][2], date_format, tz = "UTC"
)
## [1] "2016-07-04 07:21:00 UTC"
Finding the % user busy
The "Engine Utilization" section is found via
i_engine_util <- which(
stri_detect_fixed(lines, "Engine Utilization (Tick %)")
)
We want the first instance of "Server Summary" that comes after this line.
i_server_summary <- i_engine_util +
min(which(
stri_detect_fixed(lines[(i_engine_util + 1):n_lines], "Server Summary")
))
Use a regular expression to extract the number from the next line.
user_busy <- as.numeric(
stri_extract_first_regex(lines[i_server_summary + 1], "[0-9]+(?:\\.[0-9])")
)
## [1] 4.2
Finding the count of committed xacts
The "Committed Xacts" line is
i_comm_xacts <- which(stri_detect_fixed(lines, "Committed Xacts"))
The count value is a set of digits surrounded by space.
xacts_count <- as.integer(
stri_extract_all_regex(lines[i_comm_xacts], "(?<= )[0-9]+(?= )")
)
## [1] 41198
Combining the results
data.frame(
Date = current_run_date,
CPU = user_busy,
Count = xacts_count
)

Splitting string columns FAST in R

I have a data frame with 107 columns and 745000 rows (much bigger than in my example).
The case is, that I have character type columns which I want to separate, because they seem to contain some type-ish ending at the end of each sequence.
I want to saparate these type-ending parts to new columns.
I have made my own solution, but it seem to be far too slow for iterating through all the 745000 rows 53 times.
So I embed my solution in the following code, with some arbitrary data:
set.seed(1)
code_1 <- paste0(round(runif(5000, 100000, 999999)), "_", round(runif(1000, 1, 15)))
code_2 <- sample(c(paste0(round(runif(10, 100000, 999999)), "_", round(runif(10, 1, 15))), NA), 5000, replace = TRUE)
code_3 <- sample(c(paste0(round(runif(3, 100000, 999999)), "_", round(runif(3, 1, 15))), NA), 5000, replace = TRUE)
code_4 <- sample(c(paste0(round(runif(1, 100000, 999999)), "_", round(runif(1, 1, 15))), NA), 5000, replace = TRUE)
code_type_1 <- rep(NA, 5000)
code_type_2 <- rep(NA, 5000)
code_type_3 <- rep(NA, 5000)
code_type_4 <- rep(NA, 5000)
df <- data.frame(cbind(code_1,
code_2,
code_3,
code_4,
code_type_1,
code_type_2,
code_type_3,
code_type_4),
stringsAsFactors = FALSE)
df_new <- data.frame(code_1 = character(),
code_2 = character(),
code_3 = character(),
code_4 = character(),
code_type_1 = character(),
code_type_2 = character(),
code_type_3 = character(),
code_type_4 = character(),
stringsAsFactors = FALSE)
for (i in 1:4) {
i_t <- i + 4
temp <- strsplit(df[, c(i)], "[_]")
for (j in 1:nrow(df)) {
df_new[c(j), c(i)] <- unlist(temp[j])[1]
df_new[c(j), c(i_t)] <- ifelse(is.na(unlist(temp[j])[1]), NA, unlist(temp[j])[2])
}
print(i)
}
for (i in 1:8) {
df_new[, c(i)] <- factor(df_new[, c(i)])
}
Do anyone have some ideas how to speed things up here?
First we pre-allocate the results data.frame to the desired final length. This is very important; see The R Inferno, Circle 2. Then we vectorize the inner loop. We also use fixed = TRUE and avoid the regex in strsplit.
system.time({
df_new1 <- data.frame(code_1 = character(nrow(df)),
code_2 = character(nrow(df)),
code_3 = character(nrow(df)),
code_4 = character(nrow(df)),
code_type_1 = character(nrow(df)),
code_type_2 = character(nrow(df)),
code_type_3 = character(nrow(df)),
code_type_4 = character(nrow(df)),
stringsAsFactors = FALSE)
for (i in 1:4) {
i_t <- i + 4
temp <- do.call(rbind, strsplit(df[, c(i)], "_", fixed = TRUE))
df_new1[, i] <- temp[,1]
df_new1[, i_t] <- ifelse(is.na(temp[,1]), NA, temp[,2])
}
df_new1[] <- lapply(df_new1, factor)
})
# user system elapsed
# 0.029 0.000 0.029
all.equal(df_new, df_new1)
#[1] TRUE
Of course, there are ways to make this even faster, but this is close to your original approach and should be sufficient.
Here's another way, using gsub inside a custom function in combination with purrr::dmap() - which is equivalent to lapply, but outputs a data.frame instead of a list.
library(purrr)
# Define function which gets rid of everything after and including "_"
replace01 <- function(df, ptrn = "_.*")
dmap(df[,1:4], gsub, pattern = ptrn, replacement = "")
# Because "pattern" is argument we can change it to get 2nd part, then cbind()
test <- cbind(replace01(df),
replace01(df, ptrn = ".*_"))
Note that the output here character columns, you can always convert them to factor if you like.
Another possibility:
setNames(do.call(rbind.data.frame, lapply(1:nrow(df), function(i) {
x <- stri_split_fixed(df[i, 1:4], "_", 2, simplify=TRUE)
y <- c(x[,1], x[,2])
y[y==""] <- NA
y
})), colnames(df)) -> df_new
or
setNames(do.call(rbind.data.frame, lapply(1:nrow(df), function(i) {
x <- stri_split_fixed(df[i, 1:4], "_", 2, simplify=TRUE)
c(x[,1], x[,2])
})), colnames(df)) -> df_new
df_new[df_new==""] <- NA
df_new
which is marginally faster:
Unit: milliseconds
expr min lq mean median uq max neval cld
na_after 669.8357 718.1301 724.8803 723.5521 732.9998 790.1405 10 a
na_inner 719.3362 738.1569 766.4267 762.1594 791.6198 825.0269 10 b

R: convert text duration ("..d ..h ..m ..s") into seconds

Trying to convert the following durations into seconds
x <- "1005d 16h 09m 57s"
x1 <- "16h 09m 57s"
x2 <- "06h 09m 57s"
x3 <- "09m 57s"
x4 <- "57s"
I've modified the answer from Jthorpe in this post Convert factor of format Hh Mm Ss to time duration.
days <- as.numeric(gsub('^*([0-9]+)d.*$','\\1',x3))
hours <- as.numeric(gsub('^.*([0-9][0-9])h.*$','\\1',x3))
minutes <- as.numeric(gsub('^.*([0-9][0-9])m.*$','\\1',x4))
seconds <- as.numeric(gsub('^.*([0-9][0-9])s.*$','\\1',x4))
duration_seconds <- seconds + 60*minutes + 60*60*hours + 24*60*60*days
However, this is only working with x, but not x1-x4. Now, I know I can probably use if logic to get around the issue, but is there a better way?
Thanks in advance.
We can change the space character (\\s+) with + using gsub, then we can replace 'd', 'h', 'm', 's' with gsubfn and loop through the output and evaluate the string.
library(gsubfn)
v2 <- gsubfn("[a-z]", list(d="*24*60*60", h = "*60*60", m = "*60",
s="*1"), gsub("\\s+", "+", v1))
unname(sapply(v2, function(x) eval(parse(text=x))))
#[1] 86890197 58197 22197 597 57
data
v1 <- c(x, x1, x2, x3, x4)
Use:
ifelse(is.na(your_exp),0)
So that whenever na is the output of your expression it becomes 0.
Eg:
days <- ifelse(is.na(as.numeric(gsub('^*([0-9]+)d.*$','\\1',x1))),0)
hours <- ifelse(is.na(as.numeric(gsub('^.*([0-9][0-9])h.*$','\\1',x1))),0)
minutes <- ifelse(is.na(as.numeric(gsub('^.*([0-9][0-9])m.*$','\\1',x1))),0)
seconds <- ifelse(is.na(as.numeric(gsub('^.*([0-9][0-9])s.*$','\\1',x1))),0)
Output:(after duration_seconds <- seconds + 60*minutes + 60*60*hours + 24*60*60*days)
> duration_seconds
[1] 58197

Select columns of data.table based on regex

How can I select columns of a data.table based on a regex?
Consider a simple example as follows:
library(data.table)
mydt <- data.table(foo=c(1,2), bar=c(2,3), baz=c(3,4))
Is there a way to use columns of bar and baz from the datatable based on a regex? I know that the following solution works but if the table is much bigger and I would like to choose more variables this could easily get cumbersome.
mydt[, .(bar, baz)]
I would like to have something like matches() in dplyr::select() but only by reference.
You can also try to use %like% from data.table package, which is a "convenience function for calling regexpr". However makes code more readable ;)
In this case, answering your question:
mydt[, .SD, .SDcols = names(mydt) %like% "bar|baz"]
As %like% returns a logical vector, whe can use the following to get every column except those which contain "foo":
mydt[, .SD, .SDcols = ! names(mydt) %like% "foo"]
where !negates the logical vector.
David's answer will work. But if your regex is long and you would rather it be done first, try:
cols <- grep("<regex pattern>", names(mydt), value=T)
mydt[, cols, with=FALSE]
It just depends on your preferences and needs. You can also assign the subsetted table to a chosen variable if you need the original intact.
UPDATE: I updated the comparison with #sindri_baldur's answer - using version 1.12.6. According to the results, patterns() is a handy shortcut, but if performance matters, one should stick with the .. or with = FALSE solution (see below).
Apparently, there is a new way of achieving this from version 1.10.2 onwards.
library(data.table)
cols <- grep("bar|baz", names(mydt), value = TRUE)
mydt[, ..cols]
It seems to work the fastest out of the posted solutions.
# Creating a large data.table with 100k rows, 32 columns
n <- 100000
foo_cols <- paste0("foo", 1:30)
big_dt <- data.table(bar = rnorm(n), baz = rnorm(n))
big_dt[, (foo_cols) := rnorm(n)]
# Methods
subsetting <- function(dt) {
subset(dt, select = grep("bar|baz", names(dt)))
}
usingSD <- function(dt) {
dt[, .SD, .SDcols = names(dt) %like% "bar|baz"]
}
usingWith <- function(dt) {
cols <- grep("bar|baz", names(dt), value = TRUE)
dt[, cols, with = FALSE]
}
usingDotDot <- function(dt) {
cols <- grep("bar|baz", names(dt), value = TRUE)
dt[, ..cols]
}
usingPatterns <- function(dt) {
dt[, .SD, .SDcols = patterns("bar|baz")]
}
# Benchmark
microbenchmark(
subsetting(big_dt), usingSD(big_dt), usingWith(big_dt), usingDotDot(big_dt), usingPatterns(big_dt),
times = 5000
)
#Unit: microseconds
# expr min lq mean median uq max neval
# subsetting(big_dt) 430 759 1672 1309 1563 82934 5000
# usingSD(big_dt) 547 951 1872 1461 1797 60357 5000
# usingWith(big_dt) 278 496 1331 1112 1304 62656 5000
# usingDotDot(big_dt) 289 483 1392 1117 1344 55878 5000
# usingPatterns(big_dt) 596 1019 1984 1518 1913 120331 5000
Since data.table v1.12.0 (Jan 2019) you can do:
mydt[, .SD, .SDcols = patterns("bar|baz")]
From the official documentation ?data.table, on the .SDcols argument:
[...] you can filter columns to include in .SD based on their names according to regular
expressions via .SDcols=patterns(regex1, regex2, ...). The included
columns will be the intersection of the columns identified by each
pattern; pattern unions can easily be specified with | in a regex. [...] You
can also invert a pattern as usual with .SDcols = !patterns(...).
There is also a subset method for "data.table", so you can always use something like the following:
subset(mydt, select = grep("bar|baz", names(mydt)))
# bar baz
# 1: 2 3
# 2: 3 4
It turns out that creating a startswith type of function for "data.table" is not very straightforward.
I suggest this one-liner code for readability and performance.
mydt[,names(mydt) %like% "bar|baz", with=F]
Following #Janosdivenji's answer:
See usingLikeon the last row
Unit: microseconds
expr min lq mean median uq max neval
subsetting(big_dt) 370.582 977.2760 1194.875 1016.4340 1096.9285 25750.94 5000
usingSD(big_dt) 554.330 1084.8530 1352.039 1133.4575 1226.9060 189905.39 5000
usingWith(big_dt) 238.481 832.7505 1017.051 866.6515 927.8460 22717.83 5000
usingDotDot(big_dt) 256.005 844.8770 1101.543 878.9935 936.6040 181855.43 5000
usingPatterns(big_dt) 569.787 1128.0970 1411.510 1178.2895 1282.2265 177415.23 5000
usingLike(big_dt) 262.868 852.5805 1059.466 887.3455 948.6665 23971.70 5000

removing zeros from the right and adding some to the left of a number?

Suppose I have the following vector
test<- c(374500, 2270400)
First I want to get the remove the zeros, to obtain something like:
test2<- c(3745, 22704)
Then, I want to add zeros to the left in order to have 6 digits. That part I know how to do:
test3 <- formatC(test2, width = 6, format = "d", flag = "0")
Can you help me with the first step?
You can use regular expressions:
as.integer(sub("0*$", "", test))
# [1] 3745 22704
Also, here is a fun one using recursion:
remove_zeroes <- function(x) {
x <- as.integer(x)
i <- x %% 10L == 0
if (any(i)) {
x[i] <- x[i] / 10L
Recall(x)
} else x
}
remove_zeroes(c(123, 1230, 1230000))
# [1] 123 123 123
Benchmarks:
test <- sample.int(1e5)
library(microbenchmark)
microbenchmark(
as.integer(sub("0*$", "", test)),
as.integer(sub("0+$", "", test)),
remove_zeroes(test))
# Unit: milliseconds
# expr min lq median uq max neval
# as.integer(sub("0*$", "", test)) 134.51669 138.91855 141.28812 145.96486 170.93705 100
# as.integer(sub("0+$", "", test)) 113.91206 118.83564 123.42199 126.44162 179.03642 100
# remove_zeroes(test) 38.01125 47.45385 49.79928 54.87592 89.05354 100
Late to the party but here's a qdap approach:
test<- c(374500, 2270400)
library(qdap)
pad(as.numeric(rm_default(test, pattern="0+$")), 6, sort=FALSE)
## [1] "003745" "022704"