Combine regex 'or' with stop at first occurence - regex

Conceptually, I want to search for (a|b) and get only the first occurrence. I know this is a lazy/non-greedy application, but can't seem to combine it properly with the or.
Moving beyond the conceptual level, which might change things a lot, a and b are actually longer patterns, but they have been tested separately and work fine. And I'm using this in strapply from package gsubfn which intrinsically finds all matches.
I suspect the answer is here in SO somewhere, but it's hard to search on such things.
Details: I'm trying to find function expressions var functionName = function(...) and function declarations function functionName(...) and extract the name of the function in javascript (parsing the lines with R). a is \\s*([[:alnum:]]*)\\s*=*\\s*function\\s*\\([^d|i] and b is \\s*function\\s*([[:alnum:]]+)\\s*\\([^d|i]. They work fine individually. A single function definition will take one form or the other, so I need to stop searching when I find one.
EDIT: In this string Here is a string of blah blah blah I'd like to find only the first 'a' using (a|b) or the first 'b' only using (b|a), plus of course whatever regex goodies I am missing.
EDIT 2: A big thanks to all who have looked at this. The details turn out to be important, so I'm going to post more info. Here are the test lines I am searching:
dput(lines)
c("var activateBrush = function() {", " function brushed() { // Handles the response to brushing",
" var followMouse = function(mX, mY) { // This draws the guides, nothing else",
".x(function(d) { return xContour(d.x); })", ".x(function(i) { return xContour(d.x); })"
)
Here are the two patterns I want to use, and how I use them individually.
fnPat1 <- "\\s*function\\s*([[:alnum:]]+)\\s*\\([^d|i]" # conveniently drops 'var'
fnNames <- unlist(strapply(pattern = fnPat1, replacement = paste0, X = lines))
fnPat2 <- "\\s*([[:alnum:]]*)\\s*=*\\s*function\\s*\\([^d|i]" # conveniently drops 'var'
fnNames <- unlist(strapply(pattern = fnPat2, replacement = paste0, X = lines))
They return, in order:
[1] "brushed" "brushed"
[1] "activateBrush" "followMouse" "activateBrush" "followMouse"
What I want to do is use both of these patterns at the same time. What I tried was
fnPat3 <- paste("((", fnPat1, ")|(", fnPat2, "))") # which is (a|b) of the orig. question
But that returns
[1] " activateBrush = function() " " function brushed() "
What I want is a vector of all the function names, namely c("brushed", "activateBrush", "followMouse") Duplicates are fine, I can call unique.
Maybe this is clearer now, maybe someone sees an entirely different approach. Thanks everyone!

To match the first a or b,
> x <- "Here is a string of blah blah blah"
> m <- regexpr("[ab]", x)
> regmatches(x, m)
[1] "a"
> x <- "Here b is a string of blah blah blah"
> m <- regexpr("[ab]", x)
> regmatches(x, m)
[1] "b"
Check the regex with sub function whether the regex matches the first a,b or not. In the below , using sub function i just replaced first a or b with ***. We use the advantage of sub function here, ie it won't do a global replacement. It only replace the first occurance of the characters which matches the given pattern or regex.
> x <- "Here is a string of blah blah blah"
> sub("[ab]", "***", x)
[1] "Here is *** string of blah blah blah"
> x <- "Here b is a string of blah blah blah"
> sub("[ab]", "***", x)
[1] "Here *** is a string of blah blah blah"
We could use gregexpr or gsub functions also.
> x <- "Here is a string of blah blah blah"
> m <- gregexpr("^[^ab]*\\K[ab]", x, perl=TRUE)
> regmatches(x, m)
[[1]]
[1] "a"
> gsub("^[^ab]*\\K[ab]", "***", x, perl=TRUE)
[1] "Here is *** string of blah blah blah"
> x <- "Here b is a string of blah blah blah"
> gsub("^[^ab]*\\K[ab]", "***", x, perl=TRUE)
[1] "Here *** is a string of blah blah blah"
Explanation:
^ Asserts that we are at the start.
[^ab]*, negated character class which matches any character but not of a or b zero or more times. We don't use [^ab]+ because there is a chance of a or b would be present at the start of the line.
\K discards the previously matched characters. ie, it removes all the characters which are matched by [^ab]* regex from printing.
[ab] Now it matches the following a or b

It seems to me this would be alot easier combining the expressions ...
strapply(lines, '(?:var|function)\\s*([[:alnum:]]+)', simplify = c)
# [1] "activateBrush" "brushed" "followMouse"
(?: ... ) is a Non-capturing group. By placing ?: inside you specify that the group is not to be captured, but to group things. Saying, group but do not capture "var" or "function" then capture the word characters that follow.

Try str_extract() from stringr package.
str_extract("b a", "a|b")
[1] "b"
str_extract("a b", "a|b")
[1] "a"
str_extract(c("a b", "b a"), "a|b")
[1] "a" "b"

Related

Extracting hashtags AND attached string elements (IF ANY) with regular expressions AND positive lookarounds and lookbehinds in r

I'd like to create a function in r using regular expressions that extracts hashtags (and one for #'s as well) but checks to see if its a part of a string and return those parts of that string. I'm still picking up hashtags (and #'s) and so I'm assuming that I'm not picking up pure hashtag strings (#word) because this is after using a function to remove URLs, emails, hashtags, and #'s via:
clean.text <- function(x){
x <- gsub("http[^[:space:]]+"," ", x)
x <- gsub("([_+A-Za-z0-9-]+(\\.[_+A-Za-z0-9-]+)*#[A-Za-z0-9-]+(\\.[A-Za-z0-9-]+)*(\\.[A-Za-z]{2,14}))","", x)
x <- gsub("\\s#[[:alnum:]_]+"," ", x)
x <- gsub("\\s#[^[:space:]]+"," ", x)
x
}
So I'd like to know what parts of the string are attached to the hashtags (and #'s) because I'm still getting hashtags (and #'s) when use the following on my cleaned text.
findHash2 <- function(x){
m <- gregexpr("(#\\w+)", x, perl=TRUE)
w <- unlist(regmatches(x,m))
op <- paste(w,collapse=" ")
return(op)
}
findAT2 <- function(x){
m <- gregexpr("#(\\w+)", x, perl=TRUE)
w <- unlist(regmatches(x,m))
op <- paste(w,collapse=" ")
return(op)
}
Note: again, this is after I apply my clean.text function to my text. Would it be something like this?
findHash1 <- function(x){
m <- gregexpr("(?<=^)#\\w+(?=$)", x, perl=TRUE)
w <- unlist(regmatches(x, m))
return(paste(w, collapse=" "))
}
UPDATE Example
x <- "yp#MonicaSarkar: RT #saultracey: Sun kissed .....#olmpicrings at #towerbridge #london2012 # Tower Bridge http://t.co/wgIutHUl
x <-I don'nt#know #It would %%%%#be #best if#you just.idk#provided/a fewexample#character! strings# #my#&^( 160,000+posts#in #of text #my) #data is#so huge!# (some# that #should match#and some that #shouldn't) and post# the desired#output.#We'll take it from there."
As for the desired output, I guess something like:
[1] yp#MonicaSarkar: #saultracey: .....#olmpicrings
Or in the second example:
[1] don'nt#know if#you 160,000+posts#in %%%%#be fewexample#character!
Ultimately, I'd like to see what's attached to the hash tags.
I'd like to use a function or functions that would extract a hashtag (another function or set of functions for #'s) if part of a string in three scenarios and presented my attempt at the first: one that says it must be preceded and followed by one or more characters, another that matches if only followed by one or more characters and a third that matches only if preceded by one or more characters. That is: one that would match the string hashtag only if it's at the middle not if it's present at the start or at the end of a string, one that would match the string only if it's present at the start, and one that would match the string if it's present at the end.
Would three functions like I discussed need to be created for that type of procedure or could it be combined into one?

Regular expression to find and replace conditionally

I need to replace string A with string B, only when string A is a whole word (e.g. "MECH"), and I don't want to make the replacement when A is a part of a longer string (e.g. "MECHANICAL"). So far, I have a grepl() which checks if string A is a whole string, but I cannot figure out how to make the replacement. I have added an ifelse() with the idea to makes the gsub() replacement when grep() returns TRUE, otherwise not to replace. Any suggestions? Please see the code below. Thanks.
aa <- data.frame(type = c("CONSTR", "MECH CONSTRUCTION", "MECHANICAL CONSTRUCTION MECH", "MECH CONSTR", "MECHCONSTRUCTION"))
from <- c("MECH", "MECHANICAL", "CONSTR", "CONSTRUCTION")
to <- c("MECHANICAL", "MECHANICAL", "CONSTRUCTION", "CONSTRUCTION")
gsub2 <- function(pattern, replacement, x, ...) {
for(i in 1:length(pattern)){
reg <- paste0("(^", pattern[i], "$)|(^", pattern[i], " )|( ", pattern[i], "$)|( ", pattern[i], " )")
ifelse(grepl(reg, aa$type),
x <- gsub(pattern[i], replacement[i], x, ...),
aa$type)
}
x
}
aa$title3 <- gsub2(from, to, aa$type)
You can enclose the strings in the from vector in \\< and \\> to match only whole words:
x <- c("CONSTR", "MECH CONSTRUCTION", "MECHANICAL CONSTRUCTION MECH",
"MECH CONSTR", "MECHCONSTRUCTION")
from <- c("\\<MECH\\>", "\\<CONSTR\\>")
to <- c("MECHANICAL", "CONSTRUCTION")
for(i in 1:length(from)){
x <- gsub(from[i], to[i], x)
}
print(x)
# [1] "CONSTRUCTION" "MECHANICAL CONSTRUCTION"
# [3] "MECHANICAL CONSTRUCTION MECHANICAL" "MECHANICAL CONSTRUCTION"
# [5] "MECHCONSTRUCTION"
I use regex (?<=\W|^)MECH(?=\W|$) to get if inside the string contain whole word MECH like this.
Is that what you need?
Just for posterity, other than using the \< \> enclosure, a whole word can be defined as any string ending in a space or end-of-line (\s|$).
gsub("MECH(\\s|$)", "MECHANICAL\\1", aa$type)
The only problem with this approach is that you need to carry over the space or end-of-line that you used as part of the match, hence the encapsulation in parentheses and the backreference (\1).
The \< \> enclosure is superior for this particular question, since you have no special exceptions. However, if you have exceptions, it is better to use a more explicit method. The more tools in your toolbox, the better.

regex match within parenthesis

I'm attempting to use some regular expressions that I made for Python also work with R.
Here is what I have in Python (using the excellent re module), with my expected 3 matches:
import re
line = 'VARIABLES = "First [T]" "Second [L]" "Third [1/T]"'
re.findall('"(.*?)"', line)
# ['First [T]', 'Second [L]', 'Third [1/T]']
Now with R, here is my best attempt:
line <- 'VARIABLES = "First [T]" "Second [L]" "Third [1/T]"'
m <- gregexpr('"(.*?)"', line)
regmatches(line, m)[[1]]
# [1] "\"First [T]\"" "\"Second [L]\"" "\"Third [1/T]\""
Why did R match the whole pattern, rather than just within the parenthesis? I was expecting:
[1] "First [T]" "Second [L]" "Third [1/T]"
Furthermore, perl=TRUE didn't make any difference. Is it safe to assume that R's regex does not consider matching only the parenthesis, or is there some trick that I'm missing?
Summary of solution: thanks #flodel, it appears to work well with other patterns too, so it appears to be a good general solution. A compact form of the solution using an input string line and regular expression pattern pat is:
pat <- '"(.*?)"'
sub(pat, "\\1", regmatches(line, gregexpr(pat, line))[[1]])
Furthermore, perl=TRUE should be added to gregexpr if using PCRE features in pat.
If you print m, you'll see gregexpr(..., perl = TRUE) gives you the positions and lengths of matches for a) your full pattern including the leading and closing quotes and b) the captured (.*).
Unfortunately for you, when m is used by regmatches, it use the positions and lengths of the former.
There are two solutions I can think of.
Pass your final output through sub:
line <- 'VARIABLES = "First [T]" "Second [L]" "Third [1/T]"'
m <- gregexpr('"(.*?)"', line, perl = TRUE)
z <- regmatches(line, m)[[1]]
sub('"(.*?)"', "\\1", z)
Or use substring using the positions and lengths of the captured expressions:
start.pos <- attr(m[[1]], "capture.start")
end.pos <- start.pos + attr(m[[1]], "capture.length") - 1L
substring(line, start.pos, end.pos)
To further your understanding, see what happens if your pattern is trying to capture more than one thing. Also see that you can give names to your captures groups (what the doc refers to as Python-style named captures), here "capture1" and "capture2":
m <- gregexpr('"(?P<capture1>.*?) \\[(?P<capture2>.*?)\\]"', line, perl = TRUE)
m
start.pos <- attr(m[[1]], "capture.start")
end.pos <- start.pos + attr(m[[1]], "capture.length") - 1L
substring(line, start.pos[, "capture1"],
end.pos[, "capture1"])
# [1] "First" "Second" "Third"
substring(line, start.pos[, "capture2"],
end.pos[, "capture2"])
# [1] "T" "L" "1/T"
1) strapplyc in the gsubfn package acts in the way you were expecting:
> library(gsubfn)
> strapplyc(line, '"(.*?)"')[[1]]
[1] "First [T]" "Second [L]" "Third [1/T]"
2) Although it involves delving into m's attributes, its possible to make regmatches work by reconstructing m to refer to the captures rather than the whole match:
at <- attributes( m[[1]] )
m2 <- list( structure( c(at$capture.start), match.length = at$capture.length ) )
regmatches( line, m2 )[[1]]
3) If we knew that the strings always ended in ] and were willing to modify the regular expression then this would work:
> m3 <- gregexpr('[^"]*]', line)
> regmatches( line, m3 )[[1]]
[1] "First [T]" "Second [L]" "Third [1/T]"

Look for specific character in string and place it at different positions after a defined separator in the same string

let's define the following string s:
s <- "$ A; B; C;"
I need to translate s into:
"$ A; $B; $C;"
the semicolon is the separator. However, $ is only one of 3 special characters which can appear in the string. The data frame m holds all 3 special characters:
m <- data.frame(sp = c("$", "%", "&"))
I first used strsplit to split the string using the semicolon as the separator
> strsplit(s, ";")
[[1]]
[1] "$ A" " B" " C"
I think the next step would be to use grep or match to check if the first string contains any of the 3 special characters defined in data frame m. If so, maybe use gsub to insert the matched special character into the remaining sub strings. Then simple use paste with collapse = "" to merge the substrings together again. Does that make sense?
Cheers
What about something like this:
getmeout = gsub("[$|%|& ]", "", unlist(strsplit(s, ";")))
whatspecial = unique(gsub("[^$|%|&]", "", s))
whatspecial
# [1] "$"
getmeout
# [1] "A" "B" "C"
paste0(whatspecial, getmeout, sep=";", collapse="")
# [1] "$A;$B;$C;"
Here is one method:
library(stringr)
separator <- '; '
# extract the first part
first.part <- str_split(s, separator)[[1]][1]
first.part
# [1] "$ A"
# try to identify your special character
special <- m$sp[str_detect(first.part, as.character(m$sp))]
special
# [1] $
# Levels: $ & %
# make sure you only matched one of them
stopifnot(length(special) == 1)
# search and replace
gsub(separator, paste(separator, special, sep=""), s)
# [1] "$ A; $B; $C;"
Let me know if I missed some of your assumptions.
Back-referencing turns it into a one-liner:
s <- c( "$ A; B; C;", "& A; B; C;", "% A; B; C;" )
ms = c("$", "%", "&")
s <- gsub( paste0("([", paste(ms,collapse="") ,"]) ([A-Z]); ([A-Z]); ([A-Z]);") , "\\1 \\2; \\1 \\3; \\1 \\4" , s)
> s
[1] "$ A; $ B; $ C" "& A; & B; & C" "% A; % B; % C"
You can then make the regular expression appropriately generic (match more than one space, more than one alphanumeric character, etc.) if you need to.

R: Capitalizing everything after a certain character

I would like to capitalize everything in a character vector that comes after the first _. For example the following vector:
x <- c("NYC_23df", "BOS_3_rb", "mgh_3_3_f")
Should come out like this:
"NYC_23DF" "BOS_3_RB" "mgh_3_3_F"
I have been trying to play with regular expressions, but am not able to do this. Any suggestions would be appreciated.
You were very close:
gsub("(_.*)","\\U\\1",x,perl=TRUE)
seems to work. You just needed to use _.* (underscore followed by zero or more other characters) rather than _* (zero or more underscores) ...
To take this apart a bit more:
_.* gives a regular expression pattern that matches an underscore _ followed by any number (including 0) of additional characters; . denotes "any character" and * denotes "zero or more repeats of the previous element"
surrounding this regular expression with parentheses () denotes that it is a pattern we want to store
\\1 in the replacement string says "insert the contents of the first matched pattern", i.e. whatever matched _.*
\\U, in conjunction with perl=TRUE, says "put what follows in upper case" (uppercasing _ has no effect; if we wanted to capitalize everything after (for example) a lower-case g, we would need to exclude the g from the stored pattern and include it in the replacement pattern: gsub("g(.*)","g\\U\\1",x,perl=TRUE))
For more details, search for "replacement" and "capitalizing" in ?gsub (and ?regexp for general information about regular expressions)
gsubfn in the gsubfn package is like gsub except the replacement string can be a function. Here we match _ and everything afterwards feeding the match through toupper :
library(gsubfn)
gsubfn("_.*", toupper, x)
## [1] "NYC_23DF" "BOS_3_RB" "mgh_3_3_F"
Note that this approach involves a particularly simple regular expression.
Simple example using base::strsplit
x <- c("NYC_23df", "BOS_3_rb", "mgh_3_3_f", "a")
myCap <- function(x) {
out <- sapply(x, function(y) {
temp <- unlist(strsplit(y, "_"))
out <- temp[1]
if (length(temp[-1])) {
out <- paste(temp[1], paste(toupper(temp[-1]),
collapse="_"), sep="_")
}
return(out)
})
out
}
> myCap(x)
NYC_23df BOS_3_rb mgh_3_3_f a
"NYC_23DF" "BOS_3_RB" "mgh_3_3_F" "a"
Example using the stringr package
pkg <- "stringr"
if (!require(pkg, character.only=TRUE)) {
install.packages(pkg)
require(pkg, character.only=TRUE)
}
myCap.2 <- function(x) {
out <- sapply(x, function(y) {
idx <- str_locate(y, "_")
if (!all(is.na(idx[1,]))) {
str_sub(y, idx[,1], nchar(y)) <- toupper(str_sub(y, idx[,1], nchar(y)))
}
return(y)
})
out
}
> myCap.2(x)
NYC_23df BOS_3_rb mgh_3_3_f a
"NYC_23DF" "BOS_3_RB" "mgh_3_3_F" "a"