Regular Expression Select Comma But Not In Between Parentheses - regex

I'm looking to create a function in R that loads the defaults of a given function. To do this, I'm using the args argument on a function and looking to break it down to the defaulted arguments of the function and load those into the global environment. This takes a bit of regular expressions and have bumped into this that I'm having difficulty addressing.
Here is a sample function:
myFunc <- function(a = 1, b = "hello world", c = c("Hello", "World")) {}
I've gotten it down to this point using my own functions:
x <- "a = 1, b = \"hello world\", c = c(\"Hello\", \"World\")"
However, where I am struggling is on splitting the function arguments up. I wanted to split on a comma, but if you have a function argument that has a comma within the default (like the c argument does), then that causes issues. What I'm thinking is if there is a way to call a regular expression that matches a comma, but not a comma this in between two parentheses, then I could use strsplit with that expression to get what I want.
My attempt to match the case of a comma between two parentheses looks like this:
\\(.*,.*\\)
Now, I've looked into how to do what I described above and it seems like a negative look ahead may be what I need, so I've attempted to do something like this.
splitx <- strsplit(x, "(?!\\(.*,.*\\)(,)")
But R tells me it is an illegal regular expression. If I set perl = TRUE in the argument, it just returns the same string. Any help here would be greatly appreciated and I hope I've been clear!

I'm going to try and answer your underlying question.
The function formals() returns a pairlist of the formal arguments of a function. You can use the result of formals() by testing for is.symbol() and is.null(). Anything that isn't a symbol and isn't null either, contains a default value.
For example:
get_default_args <- function(fun){
x <- formals(fun)
w <- sapply(x, function(x)!is.symbol(x) && !is.null(x))
x[w]
}
Try it on lm():
get_default_args(lm)
$method
[1] "qr"
$model
[1] TRUE
$x
[1] FALSE
$y
[1] FALSE
$qr
[1] TRUE
$singular.ok
[1] TRUE
Try it on your function:
myFunc <- function(a = 1, b = "hello world", c = c("Hello", "World")) {}
get_default_args(myFunc)
$a
[1] 1
$b
[1] "hello world"
$c
c("Hello", "World")
Note that the comments suggests using match.call(). This may or may not work for you, but match.call() evaluates the argument in the environment of the function after being called, whereas formals() evaluates the language object itself. Therefore you don't need to call the function at all when using formals().

While I don't think this is the right approach (use match.call() to extract arguments as they were passed), a matching regex is
x <- "a = 1, b = \"hello world\", c = c(\"Hello\", \"World\")"
strsplit(x, ",(?![^()]*\\))", perl=TRUE)
#> [[1]]
#> [1] "a = 1" " b = \"hello world\"" " c = c(\"Hello\", \"World\")"

Related

Sequentially replace multiple places matching single pattern in a string with different replacements

Using stringr package, it is easy to perform regex replacement in a vectorized manner.
Question: How can I do the following:
Replace every word in
hello,world??your,make|[]world,hello,pos
to different replacements, e.g. increasing numbers
1,2??3,4|[]5,6,7
Note that simple separators cannot be assumed, the practical use case is more complicated.
stringr::str_replace_all does not seem to work because it
str_replace_all(x, "(\\w+)", 1:7)
produces a vector for each replacement applied to all words, or it has
uncertain and/or duplicate input entries so that
str_replace_all(x, c("hello" = "1", "world" = "2", ...))
will not work for the purpose.
Here's another idea using gsubfn. The pre function is run before the substitutions and the fun function is run for each substitution:
library(gsubfn)
x <- "hello,world??your,make|[]world,hello,pos"
p <- proto(pre = function(t) t$v <- 0, # replace all matches by 0
fun = function(t, x) t$v <- v + 1) # increment 1
gsubfn("\\w+", p, x)
Which gives:
[1] "1,2??3,4|[]5,6,7"
This variation would give the same answer since gsubfn maintains a count variable for use in proto functions:
pp <- proto(fun = function(...) count)
gsubfn("\\w+", pp, x)
See the gsubfn vignette for examples of using count.
I would suggest the "ore" package for something like this. Of particular note would be ore.search and ore.subst, the latter of which can accept a function as the replacement value.
Examples:
library(ore)
x <- "hello,world??your,make|[]world,hello,pos"
## Match all and replace with the sequence in which they are found
ore.subst("(\\w+)", function(i) seq_along(i), x, all = TRUE)
# [1] "1,2??3,4|[]5,6,7"
## Create a cool ore object with details about what was extracted
ore.search("(\\w+)", x, all = TRUE)
# match: hello world your make world hello pos
# context: , ?? , |[] , ,
# number: 1==== 2==== 3=== 4=== 5==== 6==== 7==
Here a base R solution. It should also be vectorized.
x="hello,world??your,make|[]world,hello,pos"
#split x into single chars
x_split=strsplit(x,"")[[1]]
#find all char positions and replace them with "a"
x_split[gregexpr("\\w", x)[[1]]]="a"
#find all runs of "a"
rle_res=rle(x_split)
#replace run lengths by 1
rle_res$lengths[rle_res$values=="a"]=1
#replace run values by increasing number
rle_res$values[rle_res$values=="a"]=1:sum(rle_res$values=="a")
#use inverse.rle on the modified rle object and collapse string
paste0(inverse.rle(rle_res),collapse="")
#[1] "1,2??3,4|[]5,6,7"

Combine regex 'or' with stop at first occurence

Conceptually, I want to search for (a|b) and get only the first occurrence. I know this is a lazy/non-greedy application, but can't seem to combine it properly with the or.
Moving beyond the conceptual level, which might change things a lot, a and b are actually longer patterns, but they have been tested separately and work fine. And I'm using this in strapply from package gsubfn which intrinsically finds all matches.
I suspect the answer is here in SO somewhere, but it's hard to search on such things.
Details: I'm trying to find function expressions var functionName = function(...) and function declarations function functionName(...) and extract the name of the function in javascript (parsing the lines with R). a is \\s*([[:alnum:]]*)\\s*=*\\s*function\\s*\\([^d|i] and b is \\s*function\\s*([[:alnum:]]+)\\s*\\([^d|i]. They work fine individually. A single function definition will take one form or the other, so I need to stop searching when I find one.
EDIT: In this string Here is a string of blah blah blah I'd like to find only the first 'a' using (a|b) or the first 'b' only using (b|a), plus of course whatever regex goodies I am missing.
EDIT 2: A big thanks to all who have looked at this. The details turn out to be important, so I'm going to post more info. Here are the test lines I am searching:
dput(lines)
c("var activateBrush = function() {", " function brushed() { // Handles the response to brushing",
" var followMouse = function(mX, mY) { // This draws the guides, nothing else",
".x(function(d) { return xContour(d.x); })", ".x(function(i) { return xContour(d.x); })"
)
Here are the two patterns I want to use, and how I use them individually.
fnPat1 <- "\\s*function\\s*([[:alnum:]]+)\\s*\\([^d|i]" # conveniently drops 'var'
fnNames <- unlist(strapply(pattern = fnPat1, replacement = paste0, X = lines))
fnPat2 <- "\\s*([[:alnum:]]*)\\s*=*\\s*function\\s*\\([^d|i]" # conveniently drops 'var'
fnNames <- unlist(strapply(pattern = fnPat2, replacement = paste0, X = lines))
They return, in order:
[1] "brushed" "brushed"
[1] "activateBrush" "followMouse" "activateBrush" "followMouse"
What I want to do is use both of these patterns at the same time. What I tried was
fnPat3 <- paste("((", fnPat1, ")|(", fnPat2, "))") # which is (a|b) of the orig. question
But that returns
[1] " activateBrush = function() " " function brushed() "
What I want is a vector of all the function names, namely c("brushed", "activateBrush", "followMouse") Duplicates are fine, I can call unique.
Maybe this is clearer now, maybe someone sees an entirely different approach. Thanks everyone!
To match the first a or b,
> x <- "Here is a string of blah blah blah"
> m <- regexpr("[ab]", x)
> regmatches(x, m)
[1] "a"
> x <- "Here b is a string of blah blah blah"
> m <- regexpr("[ab]", x)
> regmatches(x, m)
[1] "b"
Check the regex with sub function whether the regex matches the first a,b or not. In the below , using sub function i just replaced first a or b with ***. We use the advantage of sub function here, ie it won't do a global replacement. It only replace the first occurance of the characters which matches the given pattern or regex.
> x <- "Here is a string of blah blah blah"
> sub("[ab]", "***", x)
[1] "Here is *** string of blah blah blah"
> x <- "Here b is a string of blah blah blah"
> sub("[ab]", "***", x)
[1] "Here *** is a string of blah blah blah"
We could use gregexpr or gsub functions also.
> x <- "Here is a string of blah blah blah"
> m <- gregexpr("^[^ab]*\\K[ab]", x, perl=TRUE)
> regmatches(x, m)
[[1]]
[1] "a"
> gsub("^[^ab]*\\K[ab]", "***", x, perl=TRUE)
[1] "Here is *** string of blah blah blah"
> x <- "Here b is a string of blah blah blah"
> gsub("^[^ab]*\\K[ab]", "***", x, perl=TRUE)
[1] "Here *** is a string of blah blah blah"
Explanation:
^ Asserts that we are at the start.
[^ab]*, negated character class which matches any character but not of a or b zero or more times. We don't use [^ab]+ because there is a chance of a or b would be present at the start of the line.
\K discards the previously matched characters. ie, it removes all the characters which are matched by [^ab]* regex from printing.
[ab] Now it matches the following a or b
It seems to me this would be alot easier combining the expressions ...
strapply(lines, '(?:var|function)\\s*([[:alnum:]]+)', simplify = c)
# [1] "activateBrush" "brushed" "followMouse"
(?: ... ) is a Non-capturing group. By placing ?: inside you specify that the group is not to be captured, but to group things. Saying, group but do not capture "var" or "function" then capture the word characters that follow.
Try str_extract() from stringr package.
str_extract("b a", "a|b")
[1] "b"
str_extract("a b", "a|b")
[1] "a"
str_extract(c("a b", "b a"), "a|b")
[1] "a" "b"

How can you increment a gsub() replacement string?

Assume a data frame has many columns that all say “bonus”. The goal is to rename each bonus column uniquely with an appended number. Example data:
string <- c("bonus", "bonus", "bonus", "bonus")
string
[1] "bonus" "bonus" "bonus" "bonus"
Desired column name output:
[1] "bonus1" "bonus2" "bonus3" "bonus4"
Assume you don’t know how many bonus columns there are be so you cannot simply paste from 1 to that number of columns to each bonus column name.
The following approach works but seems inelegant and seems too hard-coded:
bonus.count <- nrow(count(grep(pattern = "bonus", x = string)))
string.numbered <- paste0(string, seq(from = 1, to = bonus.count, 1)
How can the gsub function (or another regex-based function) substitute an incremented number? Along the lines of
string.gsub.numbered <- gsub(pattern = "bonus", replacement = "bonusincremented by one until no more bonuses", x = string)
As far as I know, gsub can't run any sort of function over each result, but using regexpr and regmatches makes this pretty easy
string <- c("bonus", "bonus", "bonus", "bonus")
m <- regexpr("bonus",string)
regmatches(string,m) <- paste0(regmatches(string,m), 1:length(m))
string
# [1] "bonus1" "bonus2" "bonus3" "bonus4"
The nice part is that regmatches allows for assignment so it's easy to swap out the matched values.
1) Using string defined in the question we can write:
paste0(string, seq_along(string))
2) If what you really have is something like this:
string2 <- "As a bonus we got a bonus coupon."
and you want to change that to "As a bonus1 we got a bonus2 coupon." then gsubfn in the gsubfn package can do that. Below, the fun method of the p proto object will be applied to each occurrence of "bonus" with count automatically incremented. THe proto object p automatically saves the state of count between matches to allow this:
library(gsubfn)
string2 <- "As a bonus we got a bonus coupon." # test data
p <- proto(fun = function(this, x) paste0(x, count))
gsubfn("bonus", p, string2)
giving:
[1] "As a bonus1 we got a bonus2 coupon."
There are additional exxamples in the proto vignette.

regex match within parenthesis

I'm attempting to use some regular expressions that I made for Python also work with R.
Here is what I have in Python (using the excellent re module), with my expected 3 matches:
import re
line = 'VARIABLES = "First [T]" "Second [L]" "Third [1/T]"'
re.findall('"(.*?)"', line)
# ['First [T]', 'Second [L]', 'Third [1/T]']
Now with R, here is my best attempt:
line <- 'VARIABLES = "First [T]" "Second [L]" "Third [1/T]"'
m <- gregexpr('"(.*?)"', line)
regmatches(line, m)[[1]]
# [1] "\"First [T]\"" "\"Second [L]\"" "\"Third [1/T]\""
Why did R match the whole pattern, rather than just within the parenthesis? I was expecting:
[1] "First [T]" "Second [L]" "Third [1/T]"
Furthermore, perl=TRUE didn't make any difference. Is it safe to assume that R's regex does not consider matching only the parenthesis, or is there some trick that I'm missing?
Summary of solution: thanks #flodel, it appears to work well with other patterns too, so it appears to be a good general solution. A compact form of the solution using an input string line and regular expression pattern pat is:
pat <- '"(.*?)"'
sub(pat, "\\1", regmatches(line, gregexpr(pat, line))[[1]])
Furthermore, perl=TRUE should be added to gregexpr if using PCRE features in pat.
If you print m, you'll see gregexpr(..., perl = TRUE) gives you the positions and lengths of matches for a) your full pattern including the leading and closing quotes and b) the captured (.*).
Unfortunately for you, when m is used by regmatches, it use the positions and lengths of the former.
There are two solutions I can think of.
Pass your final output through sub:
line <- 'VARIABLES = "First [T]" "Second [L]" "Third [1/T]"'
m <- gregexpr('"(.*?)"', line, perl = TRUE)
z <- regmatches(line, m)[[1]]
sub('"(.*?)"', "\\1", z)
Or use substring using the positions and lengths of the captured expressions:
start.pos <- attr(m[[1]], "capture.start")
end.pos <- start.pos + attr(m[[1]], "capture.length") - 1L
substring(line, start.pos, end.pos)
To further your understanding, see what happens if your pattern is trying to capture more than one thing. Also see that you can give names to your captures groups (what the doc refers to as Python-style named captures), here "capture1" and "capture2":
m <- gregexpr('"(?P<capture1>.*?) \\[(?P<capture2>.*?)\\]"', line, perl = TRUE)
m
start.pos <- attr(m[[1]], "capture.start")
end.pos <- start.pos + attr(m[[1]], "capture.length") - 1L
substring(line, start.pos[, "capture1"],
end.pos[, "capture1"])
# [1] "First" "Second" "Third"
substring(line, start.pos[, "capture2"],
end.pos[, "capture2"])
# [1] "T" "L" "1/T"
1) strapplyc in the gsubfn package acts in the way you were expecting:
> library(gsubfn)
> strapplyc(line, '"(.*?)"')[[1]]
[1] "First [T]" "Second [L]" "Third [1/T]"
2) Although it involves delving into m's attributes, its possible to make regmatches work by reconstructing m to refer to the captures rather than the whole match:
at <- attributes( m[[1]] )
m2 <- list( structure( c(at$capture.start), match.length = at$capture.length ) )
regmatches( line, m2 )[[1]]
3) If we knew that the strings always ended in ] and were willing to modify the regular expression then this would work:
> m3 <- gregexpr('[^"]*]', line)
> regmatches( line, m3 )[[1]]
[1] "First [T]" "Second [L]" "Third [1/T]"

Right gsub regexp entry

Having string like this:
"structure(list(a = 5, f = 6), .Names = c(\"a\", \"f\"))"
Where the part
"structure(list( ), .Names = c( ))"
always stays the same. Entries like x=y inside parentheses and theirs counterparts inside c() are changing both content y, label x, and count as well.
What is the right global substitution, like in sed or R gsub, to get result
"a = 5, f = 6"
using only one gsub call?
Ie, everything before and after to go away.
The intention is to get R elipsis content "as it is" like one word and combine it into text in some place in report. So the source comes from "...".
One of the solutions:
gsub("structure\\(list\\((.*)\\), .*$", "\\1", x)
# [1] "a = 5, f = 6"
or equivalently:
gsub(".*list\\((.*)\\), .*$", "\\1", x)