Splitting column of a data.frame in R using gsub - regex

I have a data.frame called rbp that contains a single column like following:
>rbp
V1
dd_smadV1_39992_0_1
Protein: AGBT(Dm)
Sequence Position
234
290
567
126
Protein: ATF1(Dm)
Sequence Position
534
890
105
34
128
301
Protein: Pox(Dm)
201
875
453
*********************
dd_smadv1_9_02
Protein: foxc2(Mm)
Sequence Position
145
987
345
907
Protein: Lor(Hs)
876
512
I would like to discard the Sequence position and extract only the specific details like the names of the sequence and the corresponding protein names like following:
dd_smadV1_39992_0_1 AGBT(Dm);ATF1(Dm);Pox(Dm)
dd_smadv1_9_02 foxc2(Mm);Lor(Hs)
I tried the following code in R but it failed:
library(gsubfn)
Sub(rbp$V1,"Protein:(.*?) ")
Could anyone guide me please.

Here's one way to to it:
m <- gregexpr("Protein: (.*?)\n", x <- strsplit(paste(rbp$V1, collapse = "\n"), "*********************", fixed = TRUE)[[1]])
proteins <- lapply(regmatches(x, m), function(x) sub("Protein: (.*)\n", "\\1", x))
names <- sub(".*?([A-z0-9_]+)\n.*", "\\1", x)
sprintf("%s %s", names, sapply(proteins, paste, collapse = ";"))
# [1] "dd_smadV1_39992_0_1 AGBT(Dm);ATF1(Dm);Pox(Dm)"
# [2] "dd_smadv1_9_02 foxc2(Mm);Lor(Hs)

Related

Extracting position of pattern in a string using ifelse in R

I have a set of strings x for example:
[1] "0000000000000000000000000000000000000Y" "9000000000D00000000000000000000Y"
[3] "0000000000000D00000000000000000000X" "000000000000000000D00000000000000000000Y"
[5] "000000000000000000D00000000000000000000Y" "000000000000000000D00000000000000000000Y"
[6]"000000000000000000000000D0000000011011D1X"
I want to extract the last position of a particular character like 1. I am running this code:
ifelse(grepl("1",x),rev(gregexpr("1",x)[[1]])[1],50)
But this is returning -1 for all elements. How do I correct this?
We can use stri_locate_last from stringi. If there are no matches, it will return NA.
library(stringi)
r1 <- stri_locate_last(v1, fixed=1)[,1]
r1
#[1] NA NA NA NA NA NA 40
nchar(v1)
#[1] 38 32 35 40 40 40 41
If we need to replace the NA values with number of characters
ifelse(is.na(r1), nchar(v1), r1)
data
v1 <- c("0000000000000000000000000000000000000Y",
"9000000000D00000000000000000000Y",
"0000000000000D00000000000000000000X",
"000000000000000000D00000000000000000000Y",
"000000000000000000D00000000000000000000Y",
"000000000000000000D00000000000000000000Y",
"000000000000000000000000D0000000011011D1X")
In base R, the following returns the position of the last matched "1".
# Make some toy data
toydata <- c("001", "007", "00101111Y", "000AAAYY")
# Find last postion
last_pos <- sapply(gregexpr("1", toydata), function(m) m[length(m)])
print(last_pos)
#[1] 3 -1 8 -1
It returns -1 whenever the pattern is not matched.

R setdiff() by regex

Is it possible to customize setdiff using regular expressions to see what is in one vector and not another? For example:
x <- c("1\t119\t120\t1\t119\t120\tABC\tDEF\t0", "2\t558\t559\t2\t558\t559\tGHI\tJKL\t0", "3\t139\t141\t3\t139\t141\tMNO\tPQR\t0", "3\t139\t143\t3\t139\t143\tSTU\tVWX\t0")
[1] "1\t119\t120\t1\t119\t120\tABC\tDEF\t0"
[2] "2\t558\t559\t2\t558\t559\tGHI\tJKL\t0"
[3] "3\t139\t141\t3\t139\t141\tMNO\tPQR\t0"
[4] "3\t139\t143\t3\t139\t143\tSTU\tVWX\t0"
y <- c("1\t119\t120\t1\t109\t120\tABC\tDEF\t0", "2\t558\t559\t2\t548\t559\tGHI\tJKL\t0", "3\t139\t141\t3\t129\t141\tMNO\tPQR\t0", "3\t139\t143\t3\t129\t143\tSTU\tVWX\t0", "4\t157\t158\t4\t147\t158\tXWX\tYTY\t0", "5\t158\t159\t5\t148\t159\tPHP\tWZW\t0")
[1] "1\t119\t120\t1\t109\t120\tABC\tDEF\t0"
[2] "2\t558\t559\t2\t548\t559\tGHI\tJKL\t0"
[3] "3\t139\t141\t3\t129\t141\tMNO\tPQR\t0"
[4] "3\t139\t143\t3\t129\t143\tSTU\tVWX\t0"
[5] "4\t157\t158\t4\t147\t158\tXWX\tYTY\t0"
[6] "5\t158\t159\t5\t148\t159\tPHP\tWZW\t0"
I want to be able to show that:
[5] "4\t157\t158\t4\t147\t158\tXWX\tYTY\t0"
[6] "5\t158\t159\t5\t148\t159\tPHP\tWZW\t0"
are new because 4\t157\t158 and 4\t157\t158 are unique to y. This doesn't work:
> setdiff(y,x)
[1] "1\t119\t120\t1\t109\t120\tABC\tDEF\t0" "2\t558\t559\t2\t548\t559\tGHI\tJKL\t0"
[3] "3\t139\t141\t3\t129\t141\tMNO\tPQR\t0" "3\t139\t143\t3\t129\t143\tSTU\tVWX\t0"
[5] "4\t157\t158\t4\t147\t158\tXWX\tYTY\t0" "5\t158\t159\t5\t148\t159\tPHP\tWZW\t0"
Because column 5 is clearly different in both x and y. I want to setdiff only based on the first three columns.
A simple example of setdiff can be found here: How to tell what is in one vector and not another?
One way to do this is to put x and y as data.frames and anti-join. I'll use data.table since I find it more natural.
library(data.table)
xDT <- as.data.table(do.call("rbind", strsplit(x, split = "\t")))
yDT <- as.data.table(do.call("rbind", strsplit(y, split = "\t")))
Now anti-join (a "setdiff" for data.frames/data.tables):
yDT[!xDT, on = paste0("V", 1:3)]
# V1 V2 V3 V4 V5 V6 V7 V8 V9
# 1: 4 157 158 4 147 158 XWX YTY 0
# 2: 5 158 159 5 148 159 PHP WZW 0
You could also get the row index (thanks to #Frank for the suggested improvement/simplification):
> yDT[!xDT, which = TRUE, on = paste0("V", 1:3)]
Or extract it directly from y:
> y[yDT[!xDT, which = TRUE, on = paste0("V", 1:3)]]
# [1] "4\t157\t158\t4\t147\t158\tXWX\tYTY\t0" "5\t158\t159\t5\t148\t159\tPHP\tWZW\t0"
We could also use anti_join from dplyr after reading it with either fread
library(data.table)
library(dplyr)
anti_join(fread(paste(y, collapse='\n')),
fread(paste(x, collapse='\n')), by = c('V1', 'V2', 'V3'))
# V1 V2 V3 V4 V5 V6 V7 V8 V9
# (int) (int) (int) (int) (int) (int) (chr) (chr) (int)
# 1 4 157 158 4 147 158 XWX YTY 0
# 2 5 158 159 5 148 159 PHP WZW 0
Or (as the title requests for regex) we can use regex to remove part of the string and then do the %in%
y[!sub('(([^\t]+\t){3}).*', '\\1', y) %in%
sub('(([^\t]+\t){3}).*', '\\1', x)]
#[1] "4\t157\t158\t4\t147\t158\tXWX\tYTY\t0" "5\t158\t159\t5\t148\t159\tPHP\tWZW\t0"

Converting numbers into time in R

My data looks like this:
> str(m)
int [1:8407] 930 1050 1225 1415 1620 1840 820 1020 1215 1410 ...
This is the time in hours and minutes. I'm trying to turn it into something (9:30, 12:10, 16:40, 8:25...).
> m1 <- strptime(m, "%H%M")
> head(m1)
[1] NA "2015-10-14 10:50:00 VLAT"
[3] "2015-10-14 12:25:00 VLAT" "2015-10-14 14:15:00 VLAT"
[5] "2015-10-14 16:20:00 VLAT" "2015-10-14 18:40:00 VLAT"
> str(m1)
POSIXlt[1:8407], format: NA "2015-10-14 10:50:00" "2015-10-14 12:25:00" ...
How to convert a set of digits in time?
Using regex:
sub("(\\d{2})$", ":\\1", x)
#[1] "9:30" "10:50" "12:25" "14:15" "16:20" "18:40" "8:20"
#[8] "10:20" "12:15" "14:10"
A match is made on the last two digits and adds a colon before it.
Data
x <- c(930, 1050, 1225, 1415, 1620, 1840, 820, 1020, 1215, 1410)
We format the numbers with sprintf to pad leading 0 for 3 digit numbers, use strptime and then use format to get the hour and min.
format(strptime(sprintf('%04d', v1), format='%H%M'), '%H:%M')
#[1] "09:30" "10:50" "12:25"
Or another option is
sub('(\\d{2})$', ':\\1', v1)
#[1] "9:30" "10:50" "12:25"
data
v1 <- c(930, 1050,1225)
Another way is,
x <- c(645, 1234,2130)
substr(as.POSIXct(sprintf("%04.0f", x), format='%H%M'), 12, 16)
#[1] "06:45" "12:34" "21:30"

R + converting a integer to a hh:mm format using regex + gsub

interval is a subset of 5 minute intervals for a 25 hour period
> interval
[1] 45 50 55 100 105 110 115 120 125 130 135 2035 2040 2045 2050 2055 2100 2105 2110 2115 2120 2125
I want to insert : to put it in a time fomat that i can convert to a time format
> gsub('^([0-9]{1,2})([0-9]{2})$', '\\1:\\2', interval)
[1] "45" "50" "55" "1:00" "1:05" "1:10" "1:15" "1:20" "1:25" "1:30" "1:35" "20:35" "20:40" "20:45"
[15] "20:50" "20:55" "21:00" "21:05" "21:10" "21:15" "21:20" "21:25"
I have got it working for nearly all my examples.
How do I get it so that it works on the numbers "5" ... "45" "50" "55"
Found this duplicate here but this does not use gsub
An easy way to do this would be to make sure all the inputs have at least 4 characters:
gsub('^([0-9]{1,2})([0-9]{2})$', '\\1:\\2', sprintf('%04d',interval))
# "00:45" "00:50" "00:55" "01:00" "01:05" "01:10" "01:15" "01:20" "01:25"
# "01:30" "01:35" "20:35" "20:40" "20:45" "20:50" "20:55" "21:00" "21:05"
# "21:10" "21:15" "21:20" "21:25"
Using sub:
> sub('..\\K', ':', sprintf('%04d',interval), perl=T)
# [1] "00:45" "00:50" "00:55" "01:00" "01:05" "01:10" "01:15" "01:20" "01:25"
# [10] "01:30" "01:35" "20:35" "20:40" "20:45" "20:50" "20:55" "21:00" "21:05"
# [19] "21:10" "21:15" "21:20" "21:25"

Replace the first N dots of a string revisited

In January I asked how to replace the first N dots of a string: replace the first N dots of a string
DWin's answer was very helpful. Can it be generalized?
df.1 <- read.table(text = '
my.string other.stuff
1111111111111111 120
..............11 220
11.............. 320
1............... 320
.......1........ 420
................ 820
11111111111111.1 120
', header = TRUE)
nn <- 14
# this works:
df.1$my.string <- sub("^\\.{14}", paste(as.character(rep(0, nn)), collapse = ""),
df.1$my.string)
# this does not work:
df.1$my.string <- sub("^\\.{nn}", paste(as.character(rep(0, nn)), collapse = ""),
df.1$my.string)
Using sprintf you can have the desired output
nn <- 3
sub(sprintf("^\\.{%s}", nn),
paste(rep(0, nn), collapse = ""), df.1$my.string)
## [1] "1111111111111111" "000...........11" "11.............."
## [4] "1..............." "000....1........" "000............."
## [7] "11111111111111.1"
pattstr <- paste0("\\.", paste0( rep(".",nn), collapse="") )
pattstr
#[1] "\\..............."
df.1$my.string <- sub(pattstr,
paste0( rep("0", nn), collapse=""),
df.1$my.string)
> df.1
my.string other.stuff
1 1111111111111111 120
2 000000000000001 220
3 11.............. 320
4 100000000000000 320
5 00000000000000. 420
6 00000000000000. 820
7 11111111111111.1 120