Can we pass regex alongside with replacement vector in R - regex

I have a string that a i need to replace with a replacement vector which i would like to use regex. is this thing possible?
txt='foo bar'
nchar(txt)
ix='foo'
gsub(ix,'bar', txt) #### output
gsub(pattern = '[^ix]', replacement = 'bar', txt)
Output desired is 'bar bar'
where ix is the char vector, how do i use pattern with regex is my real question.

We can use paste to join or a string object with another string.
sub(paste0('^',ix), 'bar', txt)
#[1] "bar bar"
NOTE: Using ^ inside [ i.e. '[^ix]' have different meaning.

Related

Subdivide an expression into alternative subpattern - using gsub()

I'm trying to subdivide my metacharacter expression in my gsub() function. But it does not return anything found.
Task: I want to delete all sections of string that contain either .ST or -XST in my vector of strings.
As you can see below, using one expression works fine. But the | expression simply does not work. I'm following the metacharacter guide on https://www.stat.auckland.ac.nz/~paul/ItDT/HTML/node84.html
What can be the issue? And what caused this issue?
My data
> rownames(table.summary)[1:10]
[1] "AAK.ST" "ABB.ST" "ALFA.ST" "ALIV-SDB.ST" "AOI.ST" "ATCO-A.ST" "ATCO-B.ST" "AXFO.ST" "AXIS.ST" "AZN.ST"
> gsub(pattern = '[.](.*)$ | [-](.*)$', replacement = "", x = rownames(table.summary)[1:10])
[1] "AAK.ST" "ABB.ST" "ALFA.ST" "ALIV-SDB.ST" "AOI.ST" "ATCO-A.ST" "ATCO-B.ST" "AXFO.ST" "AXIS.ST" "AZN.ST"
> gsub(pattern = '[.](.*)$', replacement = "", x = rownames(table.summary)[1:10])
[1] "AAK" "ABB" "ALFA" "ALIV-SDB" "AOI" "ATCO-A" "ATCO-B" "AXFO" "AXIS" "AZN"
> gsub(pattern = '[-](.*)$', replacement = "", x = rownames(table.summary)[1:10])
[1] "AAK.ST" "ABB.ST" "ALFA.ST" "ALIV" "AOI.ST" "ATCO" "ATCO" "AXFO.ST" "AXIS.ST" "AZN.ST"
It seems you tested your regex with a flag like IgnorePatternWhitespace (VERBOSE, /x) that allows whitespace inside patterns for readability. You can use it with perl=T option:
d <- c("AAK.ST","ABB.ST","ALFA.ST","ALIV-SDB.ST","AOI.ST","ATCO-A.ST","ATCO-B.ST","AXFO.ST", "AXIS.ST","AZN.ST")
gsub('(?x)[.](.*)$ | [-](.*)$', '', d, perl=T)
## [1] "AAK" "ABB" "ALFA" "ALIV" "AOI" "ATCO" "ATCO" "AXFO" "AXIS" "AZN"
However, you really do not have to use that complex regex here.
If you plan to remove all substrings from ther first hyphen or dot up to the end, you may use the following regex:
[.-].*$
The character class [.-] will match the first . or - symbol and .* wil match all characters up to the end of the string ($).
See IDEONE demo:
d <- c("AAK.ST","ABB.ST","ALFA.ST","ALIV-SDB.ST","AOI.ST","ATCO-A.ST","ATCO-B.ST","AXFO.ST", "AXIS.ST","AZN.ST")
gsub("[.-].*$", "", d)
Result: [1] "AAK" "ABB" "ALFA" "ALIV" "AOI" "ATCO" "ATCO" "AXFO" "AXIS" "AZN"
This will find .ST or -XST at the end of the text and substitute it with empty characters string (effectively removing that part). Don't forget that gsub returns modified string, not modifies it in place. You won't see any change until you reassign return value back to some variable.
strings <- c("AAK.ST", "ABB.ST", "ALFA.ST", "ALIV-SDB.ST", "AOI.ST", "ATCO-A.ST", "ATCO-B.ST", "AXFO.ST", "AXIS.ST", "AZN.ST", "AAC-XST", "AAD-XSTV")
strings <- gsub('(\\.ST|-XST)$', '', strings)
Your regular expression ([.](.*)$ | [-](.*)$'), if not for unnecessary spaces, would remove everything from first dot (.) or dash (-) to end of text. This might be what you want, but not what you said you want.

Regexp to match text with optional text in parenthesis

Given the following vector of strings x
x <- c("hello", "foo_bar", "blah_blub_(bleep)", "blah_(xyz)", "xyz(_$_)")
I am looking for a regexp to extract everything before the optional parenthesis (and its content). So the final result for the above vector should be:
c("hello", "foo_bar", "blah_blub", "blah", "xyz")
I came up with the following regexp which, however, does not work (why?):
R> sub("^(.*)[_?\\(.*\\)]?$", \\1, x)
[1] "hello" "foo_bar" "blah_blub_(bleep)" "blah_(xyz)" "xyz(_$_)"
Any help is appreciated!
We can match the pattern of zero or more _ followed by ( followed by one more characters until the end of the string and replace it with ''.
sub('_*\\(.*$', '', x)
#[1] "hello" "foo_bar" "blah_blub" "blah" "xyz"

Challenging regular expression

There is a string in the following format:
It can start with any number of strings enclosed by double braces, possibly with white space between them (whitespace may or may not occur).
It may also contain strings enclosed by double-braces in the middle.
I am looking for a regular expression that can separate the start from the rest.
For example, given the following string:
{{a}}{{b}} {{c}} def{{g}}hij
The two parts are:
{{a}}{{b}} {{c}}
def{{g}}hij
I tried this:
/^({{.*}})(.*)$/
But, it captured also the g in the middle:
{{a}}{{b}} {{c}} def{{g}}
hij
I tried this:
/^({{.*?}})(.*)$/
But, it captured only the first a:
{{a}}
{{b}} {{c}} def{{g}}hij
This keeps matching {{, any non { or } character 1 or more times, }}, possible whitespace zero or more times and stores it in the first group. Rest of the string will be in the 2nd group. If there are no parts surrounded by {{ and }} the first group will be empty. This was in JavaScript.
var str = "{{a}}{{b}} {{c}} def{{g}}hij";
str.match(/^\s*((?:\{\{[^{}]+\}\}\s*)*)(.*)/)
// [whole match, group 1, group 2]
// ["{{a}}{{b}} {{c}} def{{g}}hij", "{{a}}{{b}} {{c}} ", "def{{g}}hij"]
How about using preg_split:
$str = '{{a}}{{b}} {{c}} def{{g}}hij';
$list = preg_split('/(\s[^{].+)/', $str, -1, PREG_SPLIT_DELIM_CAPTURE | PREG_SPLIT_NO_EMPTY);
print_r($list);
output:
Array
(
[0] => {{a}}{{b}} {{c}}
[1] => def{{g}}hij
)
I think I got it:
var string = "{{a}}{{b}} {{c}} def{{g}}hij";
console.log(string.match(/((\{\{\w+\}\})\s*)+/g));
// Output: [ '{{a}}{{b}} {{c}} ', '{{g}}' ]
Explanation:
( starts a group.
( another;
\{\{\w+\}\} looks for {{A-Za-z_0-9}}
) closes second group.
\s* Counts whitespace if it's there.
)+ closes the first group and looks for oits one or more occurrences.
When it gets any not-{{something}} type data, it stops.
P.S. -> Complex RegEx takes CPU speed.
You can use this:
(java)
string[] result = yourstr.split("\\s+(?!{)");
(php)
$result = preg_split('/\s+(?!{)/', '{{a}}{{b}} {{c}} def{{g}}hij');
print_r($result);
I donĀ“t know exactly why are you want to split, but in case that the string contains always a def inside, and you want to separate the string from there in two halves, then, you can try something like:
string text = "{{a}}{{b}} {{c}} def{{g}}hij";
Regex r = new Regex("def");
string[] split = new string[2];
int index = r.Match(text).Index;
split[0] = string.Join("", text.Take(index).Select(x => x.ToString()).ToArray<string>());
split[1] = string.Join("", text.Skip(index).Take(text.Length - index).Select(x => x.ToString()).ToArray<string>());
// Output: [ '{{a}}{{b}} {{c}} ', 'def{{g}}hij' ]

Pattern matching and replacement in R

I am not familiar at all with regular expressions, and would like to do pattern matching and replacement in R.
I would like to replace the pattern #1, #2 in the vector: original = c("#1", "#2", "#10", "#11") with each value of the vector vec = c(1,2).
The result I am looking for is the following vector: c("1", "2", "#10", "#11")
I am not sure how to do that. I tried doing:
for(i in 1:2) {
pattern = paste("#", i, sep = "")
original = gsub(pattern, vec[i], original, fixed = TRUE)
}
but I get :
#> original
#[1] "1" "2" "10" "11"
instead of: "1" "2" "#10" "#11"
I would appreciate any help I can get! Thank you!
Specify that you are matching the entire string from start (^) to end ($).
Here, I've matched exactly the conditions you are looking at in this example, but I'm guessing you'll need to extend it:
> gsub("^#([1-2])$", "\\1", original)
[1] "1" "2" "#10" "#11"
So, that's basically, "from the start, look for a hash symbol followed by either the exact number one or two. The one or two should be just one digit (that's why we don't use * or + or something) and also ends the string. Oh, and capture that one or two because we want to 'backreference' it."
Another option using gsubfn:
library(gsubfn)
gsubfn("^#([1-2])$", I, original) ## Function substituting
[1] "1" "2" "#10" "#11"
Or if you want to explicitly use the values of your vector , using vec values:
gsubfn("^#[1-2]$", as.list(setNames(vec,c("#1", "#2"))), original)
Or formula notation equivalent to function notation:
gsubfn("^#([1-2])$", ~ x, original) ## formula substituting
Here's a slightly different take that uses zero width negative lookahead assertion (what a mouthful!). This is the (?!...) which matches # at the start of a string as long as it is not followed by whatever is in .... In this case two (or equivalently, more as long as they are contiguous) digits. It replaces them with nothing.
gsub( "^#(?![0-9]{2})" , "" , original , perl = TRUE )
[1] "1" "2" "#10" "#11"

Regular expression to find and replace conditionally

I need to replace string A with string B, only when string A is a whole word (e.g. "MECH"), and I don't want to make the replacement when A is a part of a longer string (e.g. "MECHANICAL"). So far, I have a grepl() which checks if string A is a whole string, but I cannot figure out how to make the replacement. I have added an ifelse() with the idea to makes the gsub() replacement when grep() returns TRUE, otherwise not to replace. Any suggestions? Please see the code below. Thanks.
aa <- data.frame(type = c("CONSTR", "MECH CONSTRUCTION", "MECHANICAL CONSTRUCTION MECH", "MECH CONSTR", "MECHCONSTRUCTION"))
from <- c("MECH", "MECHANICAL", "CONSTR", "CONSTRUCTION")
to <- c("MECHANICAL", "MECHANICAL", "CONSTRUCTION", "CONSTRUCTION")
gsub2 <- function(pattern, replacement, x, ...) {
for(i in 1:length(pattern)){
reg <- paste0("(^", pattern[i], "$)|(^", pattern[i], " )|( ", pattern[i], "$)|( ", pattern[i], " )")
ifelse(grepl(reg, aa$type),
x <- gsub(pattern[i], replacement[i], x, ...),
aa$type)
}
x
}
aa$title3 <- gsub2(from, to, aa$type)
You can enclose the strings in the from vector in \\< and \\> to match only whole words:
x <- c("CONSTR", "MECH CONSTRUCTION", "MECHANICAL CONSTRUCTION MECH",
"MECH CONSTR", "MECHCONSTRUCTION")
from <- c("\\<MECH\\>", "\\<CONSTR\\>")
to <- c("MECHANICAL", "CONSTRUCTION")
for(i in 1:length(from)){
x <- gsub(from[i], to[i], x)
}
print(x)
# [1] "CONSTRUCTION" "MECHANICAL CONSTRUCTION"
# [3] "MECHANICAL CONSTRUCTION MECHANICAL" "MECHANICAL CONSTRUCTION"
# [5] "MECHCONSTRUCTION"
I use regex (?<=\W|^)MECH(?=\W|$) to get if inside the string contain whole word MECH like this.
Is that what you need?
Just for posterity, other than using the \< \> enclosure, a whole word can be defined as any string ending in a space or end-of-line (\s|$).
gsub("MECH(\\s|$)", "MECHANICAL\\1", aa$type)
The only problem with this approach is that you need to carry over the space or end-of-line that you used as part of the match, hence the encapsulation in parentheses and the backreference (\1).
The \< \> enclosure is superior for this particular question, since you have no special exceptions. However, if you have exceptions, it is better to use a more explicit method. The more tools in your toolbox, the better.