String Editing in R - Trouble with Parentheses - regex

so I'm editing some strings in R, and I would like to delete everything that is in parentheses from a string. The problem is, I'm not very savvy with regular expressions, and it seems that any time I want to use gsub to mess with parentheses, it doesn't work, or doesn't yield the correct result.
Any hints? I have a feeling its a solvable problem. Might there be a function that I can use that isn't gsub?
Ex. Strings: "abc def (foo) abc (def)" should be stripped to "abc def abc"
If the only way to do this is to specify whats in the parentheses, that would be fine as well.

Just another way:
x <- "abc def (foo) abc (def)"
gsub(" *\\(.*?)", "", x)
You need to escape the ( with a \ in regular expressions. In R, you need to escape twice \\. And then you search for anything (.*) after the ( in a non-greedy manner, with a ? after .* followed by ) (which you don't have to escape.

Parentheses are usually special characters in regular expressions, and also in those used by R. You have to escape them with the backslash \. The trouble is that the backslash needs to be escaped in R strings as well, with a second backslash, which leads to the following rather clumsy construction:
gsub(" *\\([^)]*\\) *", " ", "abc def (foo) abc (def)")
Careful with spaces, these are not handled correctly by my gsub call.

The bracketX function in the qdap package was designed for this problem:
library(qdap)
x <- "abc def (foo) abc (def)"
bracketX(x, "round")
## > bracketX(x, "round")
## [1] "abc def abc"

Related

Vim - removing leading and trailing spaces in a function

I am trying to remove leading and trailing spaces in a function but it does not work:
function Trim(s)
echo ">>>".a:s."<<<"
let ss = substitute(a:s, '\v^\s*([^\s]+)\s*$', '\1', '')
echo ">>>".ss."<<<"
endfunction
The regex \s*([^\s]+)\s* works ok on https://regex101.com/
Replacing * with + does not make any difference.
Testing:
: call Trim(" testing ")
Output:
>>> testing <<<
>>> testing <<<
Also it seems to matter if I use double quotes versus single quotes in substitute function.
Where are the problems and how can they be solved? Thanks.
Your issue is caused by your collection.
You should use [^ ] instead of [^\s]:
function! Trim(s)
echo ">>>".a:s."<<<"
let ss = substitute(a:s, '\v^\s*([^ ]+)\s*$', '\1', '')
echo ">>>".ss."<<<"
endfunction
This is because collections work on individual characters and \s is not an individual character; it's seen as \ followed by s, which doesn't resolve to anything because s is not a special character that needs escaping.
If you want your collection to include both spaces and tabs, use this:
[^ \t]
[ \t]
where \t represents a tab.
As romainl explained, [^\s] means neither \ nor s. The contrary of \s (i.e. anything but a space or a tab) would be \S.
Otherwise, here is another solution: in lh-vim-lib I've defined the following
function! lh#string#trim(string) abort
return matchstr(a:string, '^\v\_s*\zs.{-}\ze\_s*$')
endfunction
Regarding the difference(s) between the various kinds of quote characters, see this Q/A on vi.SE: https://vi.stackexchange.com/questions/9706/what-is-the-difference-between-single-and-double-quoted-strings
You are including what needs to be retained in your search/replace. Much easier is to just look for what needs te be removed and substitute that
:%s/\v(^\s+|\s+$)//g
Breakdown
%s Start a global search/replace
\v Use Very Magic
(^\s+ search for leading spaces
| or
\s+$) search for trailing spaces
//g remove all search results from entire line

What is the type of argument "replacement" in gensub() of GAWK?

The prototype of the function gensub() in GAWK is
gensub(regexp, replacement, how [, target])
According to my observations from examples,
regexp is a regular expression enclosed in slashes
I saw in examples a quoted string is provided to replacement (see the example below).
But it can contain back-references to groups in the matched substring (see the example below), which seems to
me that the type of replacement is a regular expression, and that the quoted string provided to replacement is coerced into a regular expression.
Now I am
confused: what is the type of replacement, a string, or a regular
expression?
Can I give a regular expression enclosed in slashes to
replacement?
E.g., from the same link:
$ gawk '
> BEGIN {
> a = "abc def"
> b = gensub(/(.+) (.+)/, "\\2 \\1", "g", a)
> print b
> }'
-| def abc
Can I replace b = gensub(/(.+) (.+)/, "\\2 \\1", "g", a) with b =
gensub(/(.+) (.+)/, /\2 \1/, "g", a)?
Btw, what does -| def abc mean?
Primarily, replacement is a string with a limited set of metacharacters.
If using a regex as the replacement compiles, then it may be accepted; I'd hate to have to work out what it does.
The -| def abc is mostly just the output from the preceding (illustrative) command. The role of the -| is explained in typographical conventions as a glyph marking output to standard output; most of the other example outputs have that marker before the output. It is not a part of the awk command, anyway. The awk command would generate def abc.
What characters are treated specially?
The manual says (at gensub()):
This is done by using parentheses in the regexp to mark the components and then specifying ‘\N’ in the replacement text, where N is a digit from 1 to 9.
It also mentions 'more than sub and gsub provide), so looking at gsub(), it says:
As in sub(), the characters ‘&’ and ‘\’ are special
and sub() says:
If the special character ‘&’ appears in replacement, it stands for the precise substring that was matched by regexp. … The effect of this special character (‘&’) can be turned off by putting a backslash before it in the string. As usual, to insert one backslash in the string, you must write two backslashes. Therefore, write ‘\&’ in a string constant to include a literal ‘&’ in the replacement.

Combining regex with a literal string

I have the following code:
input <- "1-FA-1-I2-1-I2-1-I2-1-EX-1-I2-1-I3-1-FA-1-I1-1-I2-1-TR-1-I1-1-I2-1-FA-1-I3-1-I1-1-FA-1-FA-1-NR-1-I3-1-I2-1-TR-1-I1-1-I2-1-I1-1-I2-1-FA-1-I2-1-I1-1-I3-1-FA-1-QU-1-I1-1-I2-1-I2-1-I2-1-NR-1-I2-1-I2-1-NR-1-I1-1-I2-1-I1-1-NR-1-I3-1-QU-1-I2-1-I3-1-QU-1-NR-1-I2-1-I1-1-NR-1-QU-1-QU-1-I2-1-I1-1-EX"
innovation_patterns <- gsub(input, pattern = "-1-", replacement = "-")
innovation_patterns <- lapply(innovation_patterns, str_extract_all, '(?:I\\d-?)*I3(?:-?I\\d)*')
This outputs:
"I2-I3" "I3-I1" "I3-I2" "I2-I1-I3" "I3" "I2-I3"
However, I only want to extract matches to the regex that are following immediately to a specific string, e.g.:
only match the regex when it's preceded by the literal string FA-I2-I2-I2-EX.
This, for example, would be the first match of the regex, while the second match is preceded by FA-I1-I2-TR-I1-I2-FA.
The expected output is roughly the same as in the regex above, but only selecting one of the 5 matches, because it needs to be preceded by a specific literal string.
How can I modify this regex to achieve this purpose? I assume it needs to use a positive lookbehind to first identify the literal string, then execute the regex.
I don't know if I'm fully understanding what you mean, but it seems you could use positive lookbehind.
For instance:
(?<=a)b (positive lookbehind) matches the b (and only the b) in cab, but does not match bed or debt
There should be something more intuitive but i think this will do the job
literal <- "FA-I2-I2-I2-EX"
innovation_patterns <- gsub(input, pattern = "-1-", replacement = "-")
a <- lapply(strsplit(innovation_patterns, literal )[[1]], str_extract_all, '(?:I\\d-?)*I3(?:-?I\\d)*')
b <- lapply(2:length(a), function(x){
a[[x]][[1]][1]
})
print(b)
Use (*SKIP)(*F)
innovation_patterns <- gsub(input, pattern = "-1-", replacement = "-")
innovation_patterns <- lapply(innovation_patterns, str_extract_all, perl('FA-I1-I2-TR-I1-I2-FA.*(*SKIP)(*F)|(?:I\\d-?)*I3(?:-?I\\d)*'))
Syntax would be like,
partIDontWant.*(*SKIP)(*F)|choose from the string which exists before partIDontWant
DEMO
Here's is another way you could go about this.
x <- "1-FA-1-I2-1-I2-1-I2-1-EX-1-I2-1-I3-1-FA-1-I1-1-I2-1-TR-1-I1-1-I2-1-FA-1-I3-1-I1-1-FA-1-FA-1-NR-1-I3-1-I2-1-TR-1-I1-1-I2-1-I1-1-I2-1-FA-1-I2-1-I1-1-I3-1-FA-1-QU-1-I1-1-I2-1-I2-1-I2-1-NR-1-I2-1-I2-1-NR-1-I1-1-I2-1-I1-1-NR-1-I3-1-QU-1-I2-1-I3-1-QU-1-NR-1-I2-1-I1-1-NR-1-QU-1-QU-1-I2-1-I1-1-EX"
CODE
substr <- 'FA-I2-I2-I2-EX'
regex <- paste0(substr, '-?((?:I\\d-?)*I3(?:-?I\\d)*)')
gsubfn::strapply(gsub('-1-', '-', x), regex, simplify = c)
## [1] "I2-I3"
Here's how to implement it:
lapply(innovation_patterns, str_extract_all, '(?<=FA-I2-I2-I2-EX-?)(?:I\\d-?)*I3(?:-?I\\d)*');
## [[1]]
## [[1]][[1]]
## [1] "I2-I3"

How to replace a symbol by a backslash in R?

Could you help me to replace a char by a backslash in R? My trial:
gsub("D","\\","1D2")
Thanks in advance
You need to re-escape the backslash because it needs to be escaped once as part of a normal R string (hence '\\' instead of '\'), and in addition it’s handled differently by gsub in a replacement pattern, so it needs to be escaped again. The following works:
gsub('D', '\\\\', '1D2')
# "1\\2"
The reason the result looks different from the desired output is that R doesn’t actually print the result, it prints an interpretable R string (note the surrounding quotation marks!). But if you use cat or message it’s printed correctly:
cat(gsub('D', '\\\\', '1D2'), '\n')
# 1\2
When inputting backslashes from the keyboard, always escape them:
gsub("D","\\\\","1D2")
#[1] "1\\2"
or,
gsub("D","\\","1D2", fixed=TRUE)
#[1] "1\\2"
or,
library(stringr)
str_replace("1D2","D","\\\\")
#[1] "1\\2"
Note: If you want something like "1\2" as output, I'm afraid you can't do that in R (at least in my knowledge). You can use forward slashes in path names to avoid this.
For more information, refer to this issue raised in R help: How to replace double backslash with single backslash in R.
gsub("\\p{S}", "\\\\", text, perl=TRUE);
\p{S} ... Match a character from the Unicode category symbol.

Convert character to lowerCamelCase in R

I have character vector which looks like this:
x <- c("cult", "brother sister relationship", "word title")
And I want to convert it to the lowerCamelCase style looking like this:
c("cult", "brotherSisterRelationship", "wordTitle")
I played around with gsub, gregexpr, strplit, regmatches and many other functions, but couldn't get a grip.
Especially two spaces in a character seem to be difficult to handle.
Maybe someone here has an idea how to do this.
> x <- c("cult", "brother sister relationship", "word title")
> gsub(" ([^ ])", "\\U\\1", x, perl=TRUE)
[1] "cult" "brotherSisterRelationship"
[3] "wordTitle"
Quoting from pattern matching and replacement:
For perl = TRUE only, it can also contain "\U" or "\L" to convert the
rest of the replacement to upper or lower case and "\E" to end case
conversion.
A non-base alternative:
library(R.utils)
toCamelCase(x, capitalize = FALSE)
# [1] "cult" "brotherSisterRelationship" "wordTitle"