Positive look ahead in R - passing variables - regex

I got stuck in a regular expression.
I usually use this line of code to find overlapping repetitions in strings:
gregexpr("(?=ATGGGCT)",text,perl=TRUE)
[[1]]
[1] 16 45 52 75 203 210 266 273 327 364 436 443 480 506 534 570 649
attr(,"match.length")
[1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
attr(,"useBytes")
[1] TRUE
Now I want to give to gregexpr a pattern contained in a variable:
x="GGC"
and of course if I pass the variable x, gregexpr is going to search "x" and not what the variable contains
gregexpr("(?=x)",text,perl=TRUE)
[[1]]
[1] -1
attr(,"match.length")
[1] -1
attr(,"useBytes")
[1] TRUE
How can I pass my variable to gregexpr in this case of positive look ahead?

I'd play with the sprintf function:
x <- "AGA"
text <- "ACAGAGACTTTAGATAGAGAAGA"
gregexpr(sprintf("(?=%s)", x), text, perl=TRUE)
## [[1]]
## [1] 3 5 12 16 18 21
## attr(,"match.length")
## [1] 0 0 0 0 0 0
## attr(,"useBytes")
## [1] TRUE
sprintf substitutes the occurrence of %s by the value of x.

You could use paste0 which is short for paste(x, sep="") ...
x <- "GGC"
text <- 'ATGGGCTATGGGCTATGGGCTATGGGCT'
gregexpr(paste0('(?=', x, ')'), text, perl=TRUE)
# [[1]]
# [1] 4 11 18 25
# attr(,"match.length")
# [1] 0 0 0 0
# attr(,"useBytes")
# [1] TRUE
And if you want to access the overlapping matches, take a look at Overlapping matches in R

The fn$ prefix in gsubfn package supports string interpolation:
library(gsubfn)
# test data
text <- "ATGGGCTAAATGGGCT"
x <- "GGGC"
fn$gregexpr("(?=$x)", text, perl = TRUE)
See ?fn , the gsubfn home page and the gsubfn vignette, vignette("gsubfn") .

ok I solved it in this way:
text="ATGGGCTAAATGGGCT"
x="GGC"
c=paste("(?=",x,")",sep="")
r=gregexpr(c,text,perl=TRUE)

Related

Regex to capture data

I am trying to capture a set of strings using regular expressions.
The strings are of the following format
CID _At 1 2 99 1,198,498,377 414 0 0 0 3,694
The expression that I came up with is
[A-Za-z][a-z0-9A-Z_-]*\s*[0-6]\s*[0-4]\s\s[\s\d]\d\s*[0-9,]*\s*[0-9,]*\s*[0-9,]*\s*[0-9,]*\s*0\s*[0-9,]*
Although this expression works for me,and gives the necessary output ,I feel that it is not optimized .
Can someone help me optimize the expression ?
<?php
$text = 'CID _At 1 2 99 1,198,498,377 414 0 0 0 3,694';
preg_match("/^[A-Z]+\s+[A-Za-z_]+\s+[0-9]+\s+[0-9]+\s+[0-9]+\s+[0-9,]+\s+[0-9]+\s+[0-9]+\s+[0-9]+\s+[0-9]+\s+[0-9,]+$/", $text, $m);
print_r($m);
preg_match("/^([A-Z]+)\s+([A-Za-z_]+)\s+([0-9]+)\s+([0-9]+)\s+([0-9]+)\s+([0-9,]+)\s+([0-9]+)\s+([0-9]+)\s+([0-9]+)\s+([0-9]+)\s+([0-9,]+)$/", $text, $m);
print_r($m);
/*
Output:
Array
(
[0] => CID _At 1 2 99 1,198,498,377 414 0 0 0 3,694
)
Array
(
[0] => CID _At 1 2 99 1,198,498,377 414 0 0 0 3,694
[1] => CID
[2] => _At
[3] => 1
[4] => 2
[5] => 99
[6] => 1,198,498,377
[7] => 414
[8] => 0
[9] => 0
[10] => 0
[11] => 3,694
)
*/
remove the * and if you need group the internal matches with ()
example in php, if you need some length, replace "+" by "{1}" (1 char length)
if you need, minimum 1 and max 3 {1,3}
if you need minimum 1 and maximum infinite {1,}

How to remove non-alphabetic characters and convert all letter to lowercase in R?

In the following string:
"I may opt for a yam for Amy, May, and Tommy."
How to remove non-alphabetic characters and convert all letter to lowercase and sort the letters within each word in R?
Meanwhile, I try to sort words in sentence and removes the duplicates.
You could use stringi
library(stringi)
unique(stri_sort(stri_trans_tolower(stri_extract_all_words(txt, simplify = TRUE))))
Which gives:
## [1] "a" "amy" "and" "for" "i" "may" "opt" "tommy" "yam"
Update
As per mentionned by #DavidArenburg, I overlooked the "sort the letters within words" part of your question. You didn't provide a desired output and no immediate application comes to mind but, assuming you want to identify which words have a matching counterpart (string distance of 0):
unique(stri_sort(stri_trans_tolower(stri_extract_all_words(txt, simplify = TRUE)))) %>%
stringdistmatrix(., ., useNames = "strings", method = "qgram") %>%
# a amy and for i may opt tommy yam
# a 0 2 2 4 2 2 4 6 2
# amy 2 0 4 6 4 0 6 4 0
# and 2 4 0 6 4 4 6 8 4
# for 4 6 6 0 4 6 4 6 6
# i 2 4 4 4 0 4 4 6 4
# may 2 0 4 6 4 0 6 4 0
# opt 4 6 6 4 4 6 0 4 6
# tommy 6 4 8 6 6 4 4 0 4
# yam 2 0 4 6 4 0 6 4 0
apply(., 1, function(x) sum(x == 0, na.rm=TRUE))
# a amy and for i may opt tommy yam
# 1 3 1 1 1 3 1 1 3
Words with more than one 0 per row ("amy", "may", "yam") have a scrambled counterpart.
str <- "I may opt for a yam for Amy, May, and Tommy."
## Clean the words (just keep letters and convert to lowercase)
words <- strsplit(tolower(gsub("[^A-Za-z ]", "", str)), " ")[[1]]
## split the words into characters and sort them
sortedWords <- sapply(words, function(word) sort(unlist(strsplit(word, ""))))
## Join the sorted letters back together
sapply(sortedWords, paste, collapse="")
# i may opt for a yam for amy may and
# "i" "amy" "opt" "for" "a" "amy" "for" "amy" "amy" "adn"
# tommy
# "mmoty"
## If you want to convert result back to string
do.call(paste, lapply(sortedWords, paste, collapse=""))
# [1] "i amy opt for a amy for amy amy adn mmoty"
stringr will let you work on all character sets in R and at C-speed, and magrittr will let you use a piping idiom that works well for your needs:
library(stringr)
library(magrittr)
txt <- "I may opt for a yam for Amy, May, and Tommy."
txt %>%
str_to_lower %>% # lowercase
str_replace_all("[[:punct:][:digit:][:cntrl:]]", "") %>% # only alpha
str_replace_all("[[:space:]]+", " ") %>% # single spaces
str_split(" ") %>% # tokenize
extract2(1) %>% # str_split returns a list
sort %>% # sort
unique # unique words
## [1] "a" "amy" "and" "for" "i" "may" "opt" "tommy" "yam"
The qdap package that I maintain has the bag_o_words function that works well for this:
txt <- "I may opt for a yam for Amy, May, and Tommy."
library(qdap)
unique(sort(bag_o_words(txt)))
## [1] "a" "amy" "and" "for" "i" "may" "opt" "tommy" "yam"

Extract numbers from strings including '|'

I have data where some of the items are numbers separated by "|", like:
head(mintimes)
[1] "3121|3151" "1171" "1351|1381" "1050" "" "122"
head(minvalues)
[1] 14 10 11 31 Inf 22
What I would like to do is extract all the times and match them to the minvalues. To end up with something like:
times values
3121 14
3151 14
1171 10
1351 11
1381 11
1050 31
122 22
I've tried to strsplit(mintimes, "|") and I've tried str_extract(mintimes, "[0-9]+") but they don't seem to work. Any ideas?
| is a regular expression metacharacter. When used literally, these special characters need to be escaped either with [] or with \\ (or you could use fixed = TRUE in some functions). So your call to strsplit() should be
strsplit(mintimes, "[|]")
or
strsplit(mintimes, "\\|")
or
strsplit(mintimes, "|", fixed = TRUE)
Regarding your other try with stringr functions, str_extract_all() seems to do the trick.
library(stringr)
str_extract_all(mintimes, "[0-9]+")
To get your desired result,
> mintimes <- c("3121|3151", "1171", "1351|1381", "1050", "", "122")
> minvalues <- c(14, 10, 11, 31, Inf, 22)
> s <- strsplit(mintimes, "[|]")
> data.frame(times = as.numeric(unlist(s)),
values = rep(minvalues, sapply(s, length)))
# times values
# 1 3121 14
# 2 3151 14
# 3 1171 10
# 4 1351 11
# 5 1381 11
# 6 1050 31
# 7 122 22
By default strsplit splits using a regular expression and "|" is a special character in the regular expression syntax. You can either escape it
strsplit(mintimes,"\\|")
or just set fixed=T to not use regular expressions
strsplit(mintimes,"|", fixed=T)
I have written a function called cSplit that is useful for these types of things. You can get it from my Gist: https://gist.github.com/mrdwab/11380733
Usage would be:
cSplit(data.table(mintimes, minvalues), "mintimes", "|", "long")
# mintimes minvalues
# 1: 3121 14
# 2: 3151 14
# 3: 1171 10
# 4: 1351 11
# 5: 1381 11
# 6: 1050 31
# 7: 122 22
It also has a "wide" setting, in case that would be at all useful to you:
cSplit(data.table(mintimes, minvalues), "mintimes", "|", "wide")
# minvalues mintimes_1 mintimes_2
# 1: 14 3121 3151
# 2: 10 1171 NA
# 3: 11 1351 1381
# 4: 31 1050 NA
# 5: Inf NA NA
# 6: 22 122 NA
Note: The output is a data.table.
As others have mentioned, you need to escape the | to include it literally in a regular expression. As always, we can skin this cat many ways, and here's one way to do it with stringr:
x <- c("3121|3151", "1171", "1351|1381", "1050", "", "122")
library(stringr)
unlist(str_extract_all(x, "\\d+"))
# [1] "3121" "3151" "1171" "1351" "1381" "1050" "122"
This won't work as expected if you have any decimal points in a character string of numbers, so the following (which says to match anything but |) might be safer:
unlist(str_extract_all(x, '[^|]+'))
# [1] "3121" "3151" "1171" "1351" "1381" "1050" "122"
Either way, you might want to wrap the result in as.numeric.
And here's another solution using stri_split_fixed from the stringi package. As an added value, we also play with mapply and do.call.
Input data:
mintimes <- c("3121|3151", "1171", "1351|1381", "1050", "", "122")
minvalues <- c(14, 10, 11, 31, Inf, 22)
Split mintimes w.r.t. | and convert to numeric:
library("stringi")
mintimes <- lapply(stri_split_fixed(mintimes, "|"), as.numeric)
## [[1]]
## [1] 3121 3151
##
## [[2]]
## [1] 1171
##
## [[3]]
## [1] 1351 1381
##
## [[4]]
## [1] 1050
##
## [[5]]
## [1] NA
##
## [[6]]
## [1] 122
Column-bind each minvalues with corresponding mintimes:
tmp <- mapply(cbind, mintimes, minvalues)
## [[1]]
## [,1] [,2]
## [1,] 3121 14
## [2,] 3151 14
##
## [[2]]
## [,1] [,2]
## [1,] 1171 10
##
## [[3]]
## [,1] [,2]
## [1,] 1351 11
## [2,] 1381 11
##
## [[4]]
## [,1] [,2]
## [1,] 1050 31
##
## [[5]]
## [,1] [,2]
## [1,] NA Inf
##
## [[6]]
## [,1] [,2]
## [1,] 122 22
Row-bind all the 6 matrices & remove NA-rows:
res <- do.call(rbind, tmp)
res[!is.na(res[,1]),]
## [,1] [,2]
## [1,] 3121 14
## [2,] 3151 14
## [3,] 1171 10
## [4,] 1351 11
## [5,] 1381 11
## [6,] 1050 31
## [7,] 122 22
To get the output you want, try something like this:
library(dplyr)
Split.Times <- function(x) {
mintimes <- as.numeric(unlist(strsplit(as.character(x$mintimes), "\\|")))
return(data.frame(mintimes = mintimes, minvalues = x$minvalues, stringsAsFactors=FALSE))
}
df <- data.frame(mintimes, minvalues, stringsAsFactors=FALSE)
df %>%
filter(mintimes != "") %>%
group_by(mintimes) %>%
do(Split.Times(.))
This produces:
mintimes minvalues
1 1050 31
2 1171 10
3 122 22
4 1351 11
5 1381 11
6 3121 14
7 3151 14
(I borrowed from my answer here - which is pretty much the same question/problem)
Here's a qdap package approach:
mintimes <- c("3121|3151", "1171", "1351|1381", "1050", "", "122")
minvalues <- c(14, 10, 11, 31, Inf, 22)
library(qdap)
list2df(setNames(strsplit(mintimes, "\\|"), minvalues), "times", "values")
## times values
## 1 3121 14
## 2 3151 14
## 3 1171 10
## 4 1351 11
## 5 1381 11
## 6 1050 31
## 7 122 22
You can use [:punct:]
strsplit(mintimes, "[[:punct:]]")

split strings on first and last commas

I would like to split strings on the first and last comma. Each string has at least two
commas. Below is an example data set and the desired result.
A similar question here asked how to split on the first comma: Split on first comma in string
Here I asked how to split strings on the first two colons: Split string on first two colons
Thank you for any suggestions. I prefer a solution in base R. Sorry if this is a duplicate.
my.data <- read.table(text='
my.string some.data
123,34,56,78,90 10
87,65,43,21 20
a4,b6,c8888 30
11,bbbb,ccccc 40
uu,vv,ww,xx 50
j,k,l,m,n,o,p 60', header = TRUE, stringsAsFactors=FALSE)
desired.result <- read.table(text='
my.string1 my.string2 my.string3 some.data
123 34,56,78 90 10
87 65,43 21 20
a4 b6 c8888 30
11 bbbb ccccc 40
uu vv,ww xx 50
j k,l,m,n,o p 60', header = TRUE, stringsAsFactors=FALSE)
You can use the \K operator which keeps text already matched out of the result and a negative look ahead assertion to do this (well almost, there is an annoying comma at the start of the middle portion which I am yet to get rid of in the strsplit). But I enjoyed this as an exercise in constructing a regex...
x <- '123,34,56,78,90'
strsplit( x , "^[^,]+\\K|,(?=[^,]+$)" , perl = TRUE )
#[[1]]
#[1] "123" ",34,56,78" "90"
Explantion:
^[^,]+ : from the start of the string match one or more characters that are not a ,
\\K : but don't include those matched characters in the match
So the first match is the first comma...
| : or you can match...
,(?=[^,]+$) : a , so long as it is followed by [(?=...)] one or more characters that are not a , until the end of the string ($)...
Here is a relatively simple approach. In the first line we use sub to replace the first and last commas with semicolons producing s. Then we read s using sep=";" and finally cbind the rest of my.data to it:
s <- sub(",(.*),", ";\\1;", my.data[[1]])
DF <- read.table(text=s, sep =";", col.names=paste0("mystring",1:3), as.is=TRUE)
cbind(DF, my.data[-1])
giving:
mystring1 mystring2 mystring3 some.data
1 123 34,56,78 90 10
2 87 65,43 21 20
3 a4 b6 c8888 30
4 11 bbbb ccccc 40
5 uu vv,ww xx 50
6 j k,l,m,n,o p 60
Here is code to split on the first and last comma. This code draws heavily from an answer by #bdemarest here: Split string on first two colons The gsub pattern below, which is the meat of the answer, contains important differences. The code for creating the new data frame after strings are split is the same as that of #bdemarest
# Replace first and last commas with colons.
new.string <- gsub(pattern="(^[^,]+),(.+),([^,]+$)",
replacement="\\1:\\2:\\3", x=my.data$my.string)
new.string
# Split on colons
split.data <- strsplit(new.string, ":")
# Create data frame
new.data <- data.frame(do.call(rbind, split.data))
names(new.data) <- paste("my.string", seq(ncol(new.data)), sep="")
my.data$my.string <- NULL
my.data <- cbind(new.data, my.data)
my.data
# my.string1 my.string2 my.string3 some.data
# 1 123 34,56,78 90 10
# 2 87 65,43 21 20
# 3 a4 b6 c8888 30
# 4 11 bbbb ccccc 40
# 5 uu vv,ww xx 50
# 6 j k,l,m,n,o p 60
# Here is code for splitting strings on the first comma
my.data <- read.table(text='
my.string some.data
123,34,56,78,90 10
87,65,43,21 20
a4,b6,c8888 30
11,bbbb,ccccc 40
uu,vv,ww,xx 50
j,k,l,m,n,o,p 60', header = TRUE, stringsAsFactors=FALSE)
# Replace first comma with colon
new.string <- gsub(pattern="(^[^,]+),(.+$)",
replacement="\\1:\\2", x=my.data$my.string)
new.string
# Split on colon
split.data <- strsplit(new.string, ":")
# Create data frame
new.data <- data.frame(do.call(rbind, split.data))
names(new.data) <- paste("my.string", seq(ncol(new.data)), sep="")
my.data$my.string <- NULL
my.data <- cbind(new.data, my.data)
my.data
# my.string1 my.string2 some.data
# 1 123 34,56,78,90 10
# 2 87 65,43,21 20
# 3 a4 b6,c8888 30
# 4 11 bbbb,ccccc 40
# 5 uu vv,ww,xx 50
# 6 j k,l,m,n,o,p 60
# Here is code for splitting strings on the last comma
my.data <- read.table(text='
my.string some.data
123,34,56,78,90 10
87,65,43,21 20
a4,b6,c8888 30
11,bbbb,ccccc 40
uu,vv,ww,xx 50
j,k,l,m,n,o,p 60', header = TRUE, stringsAsFactors=FALSE)
# Replace last comma with colon
new.string <- gsub(pattern="^(.+),([^,]+$)",
replacement="\\1:\\2", x=my.data$my.string)
new.string
# Split on colon
split.data <- strsplit(new.string, ":")
# Create new data frame
new.data <- data.frame(do.call(rbind, split.data))
names(new.data) <- paste("my.string", seq(ncol(new.data)), sep="")
my.data$my.string <- NULL
my.data <- cbind(new.data, my.data)
my.data
# my.string1 my.string2 some.data
# 1 123,34,56,78 90 10
# 2 87,65,43 21 20
# 3 a4,b6 c8888 30
# 4 11,bbbb ccccc 40
# 5 uu,vv,ww xx 50
# 6 j,k,l,m,n,o p 60
You can do a simple strsplit here on that column
popshift<-sapply(strsplit(my.data$my.string,","), function(x)
c(x[1], paste(x[2:(length(x)-1)],collapse=","), x[length(x)]))
desired.result <- cbind(data.frame(my.string=t(popshift)), my.data[-1])
I just split up all the values and make a new vector with the first, last and middle strings. Then i cbind that with the rest of the data. The result is
my.string.1 my.string.2 my.string.3 some.data
1 123 34,56,78 90 10
2 87 65,43 21 20
3 a4 b6 c8888 30
4 11 bbbb ccccc 40
5 uu vv,ww xx 50
6 j k,l,m,n,o p 60
Using str_match() from package stringr, and a little help from one of your links,
> library(stringr)
> data.frame(str_match(my.data$my.string, "(.+?),(.*),(.+?)$")[,-1],
some.data = my.data$some.data)
# X1 X2 X3 some.data
# 1 123 34,56,78 90 10
# 2 87 65,43 21 20
# 3 a4 b6 c8888 30
# 4 11 bbbb ccccc 40
# 5 uu vv,ww xx 50
# 6 j k,l,m,n,o p 60

R list function

I have a problem applying a function to list elements. I have a list called "mylist", which looks like:
[[1]] station global
1 2
1 2
1 2
1 14
1 38
1 169
[[2]] station global
2 2
2 2
2 23
2 86
In each list, I need to set values of "global" less than or equal to 2 to NA.
I have used
dat.list <- lapply(mylist, ``[[``, 'global')
to get only the global data.
Defining af function:
fct <- function(x) {
x[x <= 2] <- NA
}
and writing
lapply(dat.list, fct)
gives
[[1]] NA
[[2]] NA
What I would like to have is:
[[1]] station global
1 NA
1 NA
1 NA
1 14
1 38
1 169
[[2]] station global
2 NA
2 NA
2 23
2 86
I apprechiate any help or a point in the right direction, Regards Sisse
It would help if you posted a reproducible example. See here for advice on how to do this.
x will take on the element of the list. Since those appear to be data.frames, treat x as a data.frame:
fct <- function(x) {
x$global[x$global <= 2] <- NA
x
}