Getting distance between two words in R - regex

Say I have a line in a file:
string <- "thanks so much for your help all along. i'll let you know when...."
I want to return a value indicating if the word know is within 6 words of help.

This is essentially a very crude implementation of Crayon's answer as a basic function:
withinRange <- function(string, term1, term2, threshold = 6) {
x <- strsplit(string, " ")[[1]]
abs(grep(term1, x) - grep(term2, x)) <= threshold
}
withinRange(string, "help", "know")
# [1] TRUE
withinRange(string, "thanks", "know")
# [1] FALSE
I would suggest getting a basic idea of the text tools available to you, and using them to write such a function. Note Tyler's comment: As implemented, this can match multiple terms ("you" would match "you" and "your") leading to funny results. You'll need to determine how you want to deal with these cases to have a more useful function.

you won't be able to get this from regex alone. I suggest splitting using space as delimiter, then loop or use a built-in function to do array search of your two terms and subtract the difference of the indexes (array positions).
edit: Okay I thought about it a second and perhaps this will work for you as a regex pattern:
\bhelp(\s+[^\s]+){1,5}+\s+know\b
This takes the same "space is the delimiter" concept. First matches for help then greedily up to 5 " word" then looks for " know" (since "know" would be the 6th).

Split your string:
> words <- strsplit(string, '\\s')[[1]]
Build a indices vector:
> indices <- 1:length(words)
Name indices:
> names(indices) <- words
Compute distance between words:
> abs(indices["help"] - indices["know"]) < 6
FALSE
EDIT In a function
distance <- function(string, term1, term2) {
words <- strsplit(string, "\\s")[[1]]
indices <- 1:length(words)
names(indices) <- words
abs(indices[term1] - indices[term2])
}
distance(string, "help", "know") < 6
EDIT Plus
There is a great advantage in indexing words, once its done you can work on a lot of statistics on a text.

Related

R: Substring after finding a character position?

I have seen a few questions concerning returning the position of a character with a String in R, but maybe I cannot seem to figure it out for my case. I think this is because I'm trying to do it for a whole column rather than a single string, but it could just be my struggles with regex.
Right now, I have a data.frame with a column, df$id that looks something like 13.23-45-6A. The number of digits before the period is variable, but I would like to retain just the part of the string after the period for each row in the column. I would like to do something like:
df$new <- substring(df$id, 1 + indexOf(".", df$id))
So 12.23-45-6A would become 23-45-6A, 0.1B would become 1B, 4.A-A would become A-A and so on for an entire column.
Right now I have:
df$new <- substr(df$id, 1 + regexpr("\\\.", data.count$id),99)
Thanks for any advice.
As #AnandaMahto mentioned his comment, you would probably be better simplifying things and using gsub:
> x <- c("13.23-45-6A", "0.1B", "4.A-A")
> gsub("[0-9]*\\.(.*)", "\\1", x, perl = T, )
[1] "23-45-6A" "1B" "A-A"
To make this work with your existing data frame you can try:
df$id <- gsub("[0-9]*\\.(.*)", "\\1", df$id, perl = T, )
another way is to use strsplit. Using #Tims example
x <- c("13.23-45-6A", "0.1B", "4.A-A")
sapply(strsplit(x, "\\."), "[", -1)
"23-45-6A" "1B" "A-A"
You could remove the characters including the . using
sub('[^.]*\\.', '', x)
#[1] "23-45-6A" "1B" "A-A"
data
x <- c("13.23-45-6A", "0.1B", "4.A-A")

Sequentially replace multiple places matching single pattern in a string with different replacements

Using stringr package, it is easy to perform regex replacement in a vectorized manner.
Question: How can I do the following:
Replace every word in
hello,world??your,make|[]world,hello,pos
to different replacements, e.g. increasing numbers
1,2??3,4|[]5,6,7
Note that simple separators cannot be assumed, the practical use case is more complicated.
stringr::str_replace_all does not seem to work because it
str_replace_all(x, "(\\w+)", 1:7)
produces a vector for each replacement applied to all words, or it has
uncertain and/or duplicate input entries so that
str_replace_all(x, c("hello" = "1", "world" = "2", ...))
will not work for the purpose.
Here's another idea using gsubfn. The pre function is run before the substitutions and the fun function is run for each substitution:
library(gsubfn)
x <- "hello,world??your,make|[]world,hello,pos"
p <- proto(pre = function(t) t$v <- 0, # replace all matches by 0
fun = function(t, x) t$v <- v + 1) # increment 1
gsubfn("\\w+", p, x)
Which gives:
[1] "1,2??3,4|[]5,6,7"
This variation would give the same answer since gsubfn maintains a count variable for use in proto functions:
pp <- proto(fun = function(...) count)
gsubfn("\\w+", pp, x)
See the gsubfn vignette for examples of using count.
I would suggest the "ore" package for something like this. Of particular note would be ore.search and ore.subst, the latter of which can accept a function as the replacement value.
Examples:
library(ore)
x <- "hello,world??your,make|[]world,hello,pos"
## Match all and replace with the sequence in which they are found
ore.subst("(\\w+)", function(i) seq_along(i), x, all = TRUE)
# [1] "1,2??3,4|[]5,6,7"
## Create a cool ore object with details about what was extracted
ore.search("(\\w+)", x, all = TRUE)
# match: hello world your make world hello pos
# context: , ?? , |[] , ,
# number: 1==== 2==== 3=== 4=== 5==== 6==== 7==
Here a base R solution. It should also be vectorized.
x="hello,world??your,make|[]world,hello,pos"
#split x into single chars
x_split=strsplit(x,"")[[1]]
#find all char positions and replace them with "a"
x_split[gregexpr("\\w", x)[[1]]]="a"
#find all runs of "a"
rle_res=rle(x_split)
#replace run lengths by 1
rle_res$lengths[rle_res$values=="a"]=1
#replace run values by increasing number
rle_res$values[rle_res$values=="a"]=1:sum(rle_res$values=="a")
#use inverse.rle on the modified rle object and collapse string
paste0(inverse.rle(rle_res),collapse="")
#[1] "1,2??3,4|[]5,6,7"

How can you increment a gsub() replacement string?

Assume a data frame has many columns that all say “bonus”. The goal is to rename each bonus column uniquely with an appended number. Example data:
string <- c("bonus", "bonus", "bonus", "bonus")
string
[1] "bonus" "bonus" "bonus" "bonus"
Desired column name output:
[1] "bonus1" "bonus2" "bonus3" "bonus4"
Assume you don’t know how many bonus columns there are be so you cannot simply paste from 1 to that number of columns to each bonus column name.
The following approach works but seems inelegant and seems too hard-coded:
bonus.count <- nrow(count(grep(pattern = "bonus", x = string)))
string.numbered <- paste0(string, seq(from = 1, to = bonus.count, 1)
How can the gsub function (or another regex-based function) substitute an incremented number? Along the lines of
string.gsub.numbered <- gsub(pattern = "bonus", replacement = "bonusincremented by one until no more bonuses", x = string)
As far as I know, gsub can't run any sort of function over each result, but using regexpr and regmatches makes this pretty easy
string <- c("bonus", "bonus", "bonus", "bonus")
m <- regexpr("bonus",string)
regmatches(string,m) <- paste0(regmatches(string,m), 1:length(m))
string
# [1] "bonus1" "bonus2" "bonus3" "bonus4"
The nice part is that regmatches allows for assignment so it's easy to swap out the matched values.
1) Using string defined in the question we can write:
paste0(string, seq_along(string))
2) If what you really have is something like this:
string2 <- "As a bonus we got a bonus coupon."
and you want to change that to "As a bonus1 we got a bonus2 coupon." then gsubfn in the gsubfn package can do that. Below, the fun method of the p proto object will be applied to each occurrence of "bonus" with count automatically incremented. THe proto object p automatically saves the state of count between matches to allow this:
library(gsubfn)
string2 <- "As a bonus we got a bonus coupon." # test data
p <- proto(fun = function(this, x) paste0(x, count))
gsubfn("bonus", p, string2)
giving:
[1] "As a bonus1 we got a bonus2 coupon."
There are additional exxamples in the proto vignette.

R: indexing result from a regular expression

I am trying to use the indexes that were returned from searching through a string for every instance of a character. When I use gregexp (pattern, text),
lookfor<-"n"
string<-"ATTnGGCnATTn"
gregexpr(pattern=lookfor,text=string)
I get the following:
[[1]]
[1] 4 8 12
attr(,"match.length")
[1] 1 1 1
attr(,"useBytes")
[1] TRUE
How do I index through the first line to be able to use those locations? Thank you in advance for your help!
Addition (2) : After thinking about this for a while, I came to the conclusion that you could have simply used unlist on your original gregexpr call
> unlist(gregexpr("n", string))
# [1] 4 8 12
From your comment
I am looking for the position of each letter n
it follows that you could do any of these:
> which(strsplit(string, "")[[1]] == "n")
# [1] 4 8 12
> cumsum(nchar(strsplit(string, "n")[[1]])+1)
# [1] 4 8 12
> nc <- 1:nchar(string)
> which(substring(string, nc, nc) == "n")
# [1] 4 8 12
Addition (1) in regards to the similar strings (comment in another answer) : You could use strsplit again, and locate those values with one of the methods above
> string2 <- "ATTTGGCCATTG"
> w <- which(strsplit(string, "")[[1]] == "n")
> strsplit(string2, "")[[1]][w]
[1] "T" "C" "G"
If you want to extract all the matches, you can use the builtin function regmatches()
m <- gregexpr(regexp,string)
regmatches(string,m)
This will return a list of character vectors because string can be greater than length 1. If you're only passing one string in, you can get at the vector of matches bypassing the list with
regmatches(string,m)[[1]]
Here is a step-by-step method to find the indices. I suspect there are more efficient ways to achieve the same result. The argument fixed = TRUE tells R to look for the literal lower case "n" rather than treat it as a regular expression.
Having done so, the [[1]] portion at the end retains only the indices element of the list
To show all indices, use the length function.
string="ATTnGGCnATTn"
index <- gregexpr(pattern = "n", text = string, fixed = TRUE)[[1]]
first.index <- index[1:length(index)]
I think I figured out my own answer!
Using the Biostrings package in Bioconductor:
string<-"ATTnGGCnATTn"
matches<-matchPattern(pattern="n", subject=string)
m<-as.vector(ranges(matches))
Now I can call up each position of of "n" in all similar strings. Thank you for those that took the time to answer.

string manipulations using R

I am working with a string vector.Each element in the vector is in the format "MMYYPub". I want to switch the place of "MM" and "YY" in the string, from "MMYYPub" to "YYMMPub" . Is this feasible in R.
Example :
vec(1)
'0100pub'
vec(2)
'0200pub'
Where the first two digits are month, and the following digits are year. There 10 years data in total from 1994 to 2013.
It might also be useful to know about the yearmon class to represent monthly data. The yearmon object can then printed in a format of choice.
library(zoo)
ym <- as.yearmon("0414pub", format = "%m%ypub")
ym
# [1] "apr 2014"
format(ym, "%y%mpub")
# [1] "1404pub"
You need to read up on regular expressions. Here is one way:
R> val <- "0405pub"
R> gsub("(\\d\\d)(\\d\\d)(.*)", "\\2\\1\\3", val)
[1] "0504pub"
R>
We use the fact that
\d denotes a digit (but need to escape the backslash)
(...) groups arguments, so here we matches one (two digits), two (also two digits) and three (remainder)
we then "simply" create the replacement string as "two before one followed by three"
There are other ways to do it, this will accomplish given the pattern you described.
Edit: Here is a shorter variant using \\d{2} to request two digits:
R> gsub("(\\d{2})(\\d{2})", "\\2\\1", val)
[1] "0504pub"
R>
One way would be using (gsub) to replace all occurrences in the vector.
> vec <- c('0100pub', '0200pub')
> gsub('([0-9]{2})([0-9]{2})', '\\2\\1', vec)
[1] "0001pub" "0002pub"
You might not need regex for this, if you just want to swap the characters around. substring and paste work just as well in this case:
> f <- function(x) paste0(substring(x,3,4), substring(x,1,2), substring(x,5))
> x
[1] "0103pub" "0204pub"
> f(x)
[1] "0301pub" "0402pub"