Stop after first match found (str_match) - regex

Is there an option to stop the search after the first "match" is found using str_match? Something equivalent to grep's "m"? I looked in stringr package, but couldnt find anything. Perhaps I missed it?
In a given string:
str <- "This is a 12-month study cycle"
I'm using the below to extract: 12-month from it
str_match(str, "(?i)(\\w+)[- ](month|months|week|weeks)")[1]
But if the string str extends to:
"This is a 12-month study cycle. In the 2 month period,blah blah...".
I'd like the search to just stop and retrieve 12-month and not get both: 12-month and 2-month. Any idea how I can do this?

How about this ?
str <- "This is a 12-month study cycle"
regmatches(str, regexpr("(?i)(\\w+)[- ](month|months|week|weeks)", str) )
[1] "12-month"
str2 <- "This is a 12-month study cycle. In the 2 month period,blah blah..."
regmatches(str2, regexpr("(?i)(\\w+)[- ](month|months|week|weeks)", str2) )
[1] "12-month"

Try stringi package. If you want to match all, use stri_match_all_regex, if just first or last use stri_match_first_regex or stri_match_last_regex.
stri_match_first_regex(str, "(?i)(\\w+)[- ](month|months|week|weeks)")
[,1] [,2] [,3]
[1,] "12-month" "12" "month"
stri_match_all_regex(str, "(?i)(\\w+)[- ](month|months|week|weeks)")
[[1]]
[,1] [,2] [,3]
[1,] "12-month" "12" "month"
[2,] "2 month" "2" "month"

Related

Capturing parts of string using regular expression in R

I have these strings:
myseq <- c("ALM_GSK_LN_06.ID","AS04_LV_06.ID.png","AS04_SP_06.IP.png")
What I want to do is to capture parts of the sequence
ALM_GSK LN ID
AS04 LV ID
AS04 SP IP
I tried this but failed:
library(stringr)
str_match(myseq, "([A-Z]+)_(LN|LV|SP)_06\\.([A-Z]+)")
Which produces:
[,1] [,2] [,3] [,4]
[1,] "GSK_LN_06.ID" "GSK" "LN" "ID"
[2,] NA NA NA NA
[3,] NA NA NA NA
>
What's the right way to do it?
You are pretty close. Here is a small adjustment:
str_match(myseq, "(.+)_(LN|LV|SP)_06\\.([A-Z]+)")[, -1]
produces:
[,1] [,2] [,3]
[1,] "ALM_GSK" "LN" "ID"
[2,] "AS04" "LV" "ID"
[3,] "AS04" "SP" "IP"
Yours doesn't work because your first token matches neither numbers or underscores, which you need for "AS04" (numbers) and "ALM_GSK" (underscores).
Your regular expression incorrectly matches the prefix because [A-Z]+ only matches letters. To fix this simply change the first group to a greedy operator such as (.+), here is another solution.
library(gsubfn)
myseq <- c('ALM_GSK_LN_06.ID', 'AS04_LV_06.ID.png', 'AS04_SP_06.IP.png')
strapply(myseq, '(.+)_([A-Z]+)[^.]+\\.([A-Z]+)', c, simplify = rbind)
# [,1] [,2] [,3]
# [1,] "ALM_GSK" "LN" "ID"
# [2,] "AS04" "LV" "ID"
# [3,] "AS04" "SP" "IP"
Totally stealing #hwnd's regex but in a tidyr/dplyr approach:
library(dplyr); library(tidyr)
data_frame(myseq) %>%
extract(myseq, c('A', 'B', 'C'), '(.+)_([A-Z]+)[^.]+\\.([A-Z]+)')
## A B C
## 1 ALM_GSK LN ID
## 2 AS04 LV ID
## 3 AS04 SP IP

php preg_match_all equivalent

I am looking for an R equivalent to PHP's preg_match_all function.
Objective:
Search a single string (not a vector of several strings) for a regexp pattern
Return a matrix of matches
Example:
Assume the following flat string without delimitation.
"This is a sample string written like a paragraph. In this string two sets of information exist. Each set contains two variables. We want to extract the sets and variables within those sets. Each information set is formatted the same way. The first set is Title: Sir; Last Name: John; and the second set is Title: Mr.; Last Name: Smith."
Using a regular expression pattern similar to
"Title: ([^;]*?); Last Name: ([^;.]*?)"
I would like to produce the following matrix from the above string:
[ ][,1] [,2]
[1,] Sir John
[2,] Mr. Smith
I have successfully accomplished this in PHP on a remote server using the preg_match_all function; however, the text files I am accessing are relatively large (not huge but slow to upload anyways). Building this in R will save a significant amount of time.
I have read up on use of grep, etc. in R but every example I have found searches for patterns in a vector and I have been unable to generate the matrix as described above.
I have also played with the stringr package but again I have not been successful generating a matrix.
This seems like a common task to me so I am sure someone smarter than me has found a solution before.
Consider the following option using regmatches :
x <- 'This is a sample string written like a paragraph. In this string two sets of information exist. Each set contains two variables. We want to extract the sets and variables within those sets. Each information set is formatted the same way. The first set is Title: Sir; Last Name: John; and the second set is Title: Mr.; Last Name: Smith.'
m <- regmatches(x, gregexpr('(?i)Title: \\K[^;]+|Last Name: \\K[^;.]+', x, perl=T))
matrix(unlist(m), ncol=2, byrow=T)
Output:
[,1] [,2]
[1,] "Sir" "John"
[2,] "Mr." "Smith"
For some reason there doesn't seem to be an easy way to extract captured matches in base (I wish regmatches also worked with captured groups but it does not). I ended up writing my own you can find it at regcapturedmatches.R. it will work with
a <- "The first set is Title: Sir and Last Name: John; and the second set is Title: Mr. and Last Name: Smith."
m<-gregexpr("Title: ([^;]*) and Last Name: ([^;.]*)", a, perl=T, ignore.case=T)
regcapturedmatches(a,m)[[1]]
This will return
[,1] [,2]
[1,] "Sir" "John"
[2,] "Mr." "Smith"
(I added the [[1]] because you said you would only operate on one string at a time. The function can operate on a vector and will return results in a list. Really, in R, every thing is a vector so there is no such thing as a "single" string, you just have a vector of strings with length 1.)
Of course this method is only as good as your regular expression. I had to modify your sample data a bit so your expression would match more than one Title/Name.
Here is a stringr version:
library(stringr)
str_match_all(x, pattern)
Produces:
[[1]]
[,1] [,2] [,3]
[1,] "Title: Sir and Last Name: John" "Sir" "John"
[2,] "Title: Mr. and Last Name: Smith" "Mr." "Smith"
Note that I had to edit your text so that the second one is also of form "and Last Name:". To get your matrix you can just do:
result[[1]][[-1]] # assumes the above is in `result`
One limitation of this is it uses regexec, which doesn't support perl regular expressions.

how to use regexp to return small part of character based on pattern

This should be easy for anyone who understands regular expressions as I'm struggling to do.
I have a vector of strings that looks like
strings<-c("jklsflk fKASJLJ (LN/WEC/WPS); jsdfjDFSDKTdfkls jfdjk kdkd(LN/WEC/WPS)",
"PEARYMP PEARYVIRGN_16 1 (LN/MP/MP)",
"08VERMLN XF03 08VERMLN_345_3 (XF/CIN/*)")
I want to convert this vector into a dataframe where each row is from an element of the original vector with 3 columns where each column comes from the part in parenthesis. So the result here would be
col1 col2 col3
"LN" "WEC" "WPS"
"LN" "MP" "MP"
"XF" "CIN" "*"
If there are more than one instance of the pattern in a string then it should take the first instance.
I think my main problem is that ( is a special character and I'm trying to escape it \( but I get an error that \( is an unrecognized escape character so I'm just a little lost.
Sounds like you're forgetting to escape the \ in \(, i.e. \\(:
do.call(rbind, strsplit(sub('.*?\\((.*?)\\).*', '\\1', strings), split = "/"))
[,1] [,2] [,3]
[1,] "LN" "WEC" "WPS"
[2,] "LN" "MP" "MP"
[3,] "XF" "CIN" "*"
1) We define a pattern that matches
left-paren non-slashes slash non-slashes slash non-right-parens remainder
which correspond to the following respectively:
\\( ([^/]+) / ([^/]+) / ([^)]+) .*
Now extract the parenthesized portions using strapplyc and simplify into a matrix. The code is:
library(gsubfn)
pat <- "\\(([^/]+)/([^/]+)/([^)]+).*"
strapplyc(strings, pat, simplify = cbind)
giving:
[,1] [,2] [,3]
[1,] "LN" "WEC" "WPS"
[2,] "LN" "MP" "MP"
[3,] "XF" "CIN" "*"
2) This alternative uses strapplyc nested in strapply. The regular expressions are slightly simpler and its still basically one line of code but that code line is longer. The first regex picks out everything between the first set of parens and the second extracts the slash-separated fields:
strapply(strings, "\\(([^)]+).*", ~ strapplyc(x, "[^/]+")[[1]], simplify = rbind)
REVISED Some improvements to first solution plus a variation as second solution.

Split column label by number of letters/characters in R

I have a large dataset where all column headers are individual IDS, each 8 characters in length. I would like to split those individual IDs into 2 rows, where the first row of IDs contains the first 7 characters, and the second row contains just the last character.
Current dataset:
ID1: Indiv01A Indiv01B Indiv02A Indiv02B Speci03A Speci03B
Intended dataset:
ID1: Indiv01 Indiv01 Indiv02 Indiv02 Speci03 Speci03
ID2: A B A B A B
I've looked through other posts on splitting data, but they all seem to have a unique way to separate the column name (ie: there's a comma separating the 2 components, or a period).
This is the code I'm thinking would work best, but I just can't figure out how to code for "7 characters" as the split point, rather than a comma:
sapply(strsplit(as.character(d$ID), ",")
Any help would be appreciated.
Here's a regular expression for a solution with strsplit. It splits the string between the 7th and the 8th character:
ID1 <- c("Indiv01A", "Indiv01B", "Indiv02A", "Indiv02B", "Speci03A", "Speci03B")
res <- strsplit(ID1, "(?<=.{7})", perl = TRUE)
# [[1]]
# [1] "Indiv01" "A"
#
# [[2]]
# [1] "Indiv01" "B"
#
# [[3]]
# [1] "Indiv02" "A"
#
# [[4]]
# [1] "Indiv02" "B"
#
# [[5]]
# [1] "Speci03" "A"
#
# [[6]]
# [1] "Speci03" "B"
Now, you can use rbind to create two columns:
do.call(rbind, res)
# [,1] [,2]
# [1,] "Indiv01" "A"
# [2,] "Indiv01" "B"
# [3,] "Indiv02" "A"
# [4,] "Indiv02" "B"
# [5,] "Speci03" "A"
# [6,] "Speci03" "B"
Explanation of the regex pattern:
(?<=.{7})
The (?<=) is a (positive) lookbehind. It matches any position that is preceded by the specified pattern. Here, the pattern is .{7}. The dot (.) matches any character. {7} means 7 times. Hence, the regex matches the position that is preceded by exactly 7 characters.
Here is a gsubfn solution:
library(gsubfn)
strapplyc(ID1, "(.*)(.)", simplify = cbind)
which gives this matrix:
[,1] [,2] [,3] [,4] [,5] [,6]
[1,] "Indiv01" "Indiv01" "Indiv02" "Indiv02" "Speci03" "Speci03"
[2,] "A" "B" "A" "B" "A" "B"
or use rbind in place of cbind if you want two columns (rather than two rows).
There are a couple of ways you could go about this.
To extract the final character
First, with substr:
new.vec <- sapply(old.vec, function(x) substr(x, nchar(x), nchar(x)))
or, with sub:
new.vec <- sub('.*(.)', '\\1', old.vec)
where old.vec is the vector of strings that you want to split.
For interest, the latter option uses a regular expression that translates to: "capture (indicating by surrounding with parentheses) the single character (.) that follows zero or more other characters (.*), and replace matches with the captured content (\\1)". For more info, see ?gsub, and here.
The above options allow for varying string lengths. However, if you do always want to split after 7 characters, and the second part of the string always has just a single character, then the following should work:
new.vec <- substr(old.vec, 8, 8)
(Edited to include method for extracting the first part of the string.)
To extract all but the final character
The process is similar.
new.vec <- sapply(old.vec, function(x) substr(x, 1, nchar(x) - 1))
new.vec <- sub('(.*).', '\\1', old.vec)
new.vec <- substr(old.vec, 1, 7)

How to measure similarity between strings?

I have a bunch of names, and I want to obtain the unique names. However, due to spelling errors and inconsistencies in the data the names might be written down wrong. I am looking for a way to check in a vector of strings if two of them are similair.
For example:
pres <- c(" Obama, B.","Bush, G.W.","Obama, B.H.","Clinton, W.J.")
I want to find that " Obama, B." and "Obama, B.H." are very similar. Is there a way to do this?
This can be done based on eg the Levenshtein distance. There are multiple implementations of this in different packages. Some solutions and packages can be found in the answers of these questions:
agrep: only return best match(es)
In R, how do I replace a string that contains a certain pattern with another string?
Fast Levenshtein distance in R?
But most often agrep will do what you want :
> sapply(pres,agrep,pres)
$` Obama, B.`
[1] 1 3
$`Bush, G.W.`
[1] 2
$`Obama, B.H.`
[1] 1 3
$`Clinton, W.J.`
[1] 4
Maybe agrep is what you want? It searches for approximate matches using the Levenshtein edit distance.
lapply(pres, agrep, pres, value = TRUE)
[[1]]
[1] " Obama, B." "Obama, B.H."
[[2]]
[1] "Bush, G.W."
[[3]]
[1] " Obama, B." "Obama, B.H."
[[4]]
[1] "Clinton, W.J."
Add another duplicate to show it works with more than one duplicate.
pres <- c(" Obama, B.","Bush, G.W.","Obama, B.H.","Clinton, W.J.", "Bush, G.")
adist shows the string distance between 2 character vectors
adist(" Obama, B.", pres)
# [,1] [,2] [,3] [,4] [,5]
# [1,] 0 9 3 10 7
For example, to select the closest string to " Obama, B." you can take the one which has the minimal distance. To avoid the identical string, I took only distances greater than zero:
d <- adist(" Obama, B.", pres)
pres[min(d[d>0])]
# [1] "Obama, B.H."
To obtain unique names, taking into account spelling errors and inconsistencies, you can compare each string to all previous ones. Then if there is a similar one, remove it. I created a keepunique() function that performs this. keepunique() is then applied to all elements of the vector successively with Reduce().
keepunique <- function(previousones, x){
if(any(adist(x, previousones)<5)){
x <- NULL
}
return(c(previousones, x))
}
Reduce(keepunique, pres)
# [1] " Obama, B." "Bush, G.W." "Clinton, W.J."