Replace text between two special characters - regex

I have a character vector as:
x<- "\t\t<taxon id=\"TOT_F50\"/>"
and
y<- "TOT_A01"
and I want replace TOT_F50 with the text in y ("TOT_A01").
Do you know how to replace the text between " and \ (i.e. "TOT_F50) ?

Try
sub('(?<=").*(?=")', y, x, perl=TRUE)
#[1] "\t\t<taxon id=\"TOT_A01\"/>"

I would use something like
gsub("\".*\"", paste0("\"", y, "\""), x)
It just means "find text within two quotation marks in x and replace it with y inside two quotation marks"
I think this is what you want, your example is wrong though

Related

Regex of consecutive punctuation in R

I have a character vector that looks like this:
z <- c("./.", "To/TO", "my/PRP$", "starved/VBN", ",/,", "wretched/JJ") # test input
[9992] "./."
[9993] "To/TO"
[9994] "my/PRP$"
[9995] "starved/VBN"
[9996] ",/,"
[9997] "wretched/JJ"
I want to remove all entries that consist of three consecutive punctuation marks, resulting in something like this:
[9993] "To/TO"
[9994] "my/PRP$"
[9995] "starved/VBN"
[9997] "wretched/JJ"
I've tried different regex expressions:
sub("[:punct:]/[:punct:]", "", z)
and
sub("[:punct:]{3}", "", z)
with either single/double brackets, both yield:
[9992] "./."
[9993] "To"
[9994] "my$"
[9995] "starved"
[9996] ",/,"
[9997] "wretched"
Any ideas? And I apologize in advance if the question is dumb; I'm not very good at this!
Try this:
x <- c("./.", "To/TO", "my/PRP$", "starved/VBN", ",/,", "wretched/JJ") # test input
grep("[[:punct:]]{3}", x, value = TRUE, invert = TRUE)
## [1] "To/TO" "my/PRP$" "starved/VBN" "wretched/JJ"

Replace a random block of characters in a string in R

I have a text and I want to replace a text block in a line, like that:
"\t\t\tFGHGFJKJKJKGDSJS"
with
x= "ABCCCBBHHJJJH"
I'm interested in changing just the text block (FGHGFJKJKJKGDSJS) without modyfing the presence of other special characters. So obtaining:
"\t\t\tABCCCBBHHJJJH"
Do it exist a way to replace FGHGFJKJKJKGDSJS without clearly specify the exact combination of letters?
I found a solution in this way: txt[n° of the line] = paste0(\t,\t,\t,x)
But I would like to know whether there is a more general solution.
> library(stringr)
> mystring <- "\t\t\tFGHGFJKJKJKGDSJS"
> x <- "ABCCCBBHHJJJH"
> str_replace(mystring,"\\w+",x)
[1] "\t\t\tABCCCBBHHJJJH"
\w+mean match any character or number or underscore at least once and as many as possible. So each part not a normal char will be replace by your x variable.
> a = "\t\t\tDFGGD"
> gsub("(\t\t\t).*","\\1ABCDF",a)
[1] "\t\t\tABCDF
mystring <- "\t\t\tFGHGFJKJKJKGDSJS"
x <- "ABCCCBBHHJJJH"
sub('\\w+',x,mystring,ignore.case=T)

R: Substring after finding a character position?

I have seen a few questions concerning returning the position of a character with a String in R, but maybe I cannot seem to figure it out for my case. I think this is because I'm trying to do it for a whole column rather than a single string, but it could just be my struggles with regex.
Right now, I have a data.frame with a column, df$id that looks something like 13.23-45-6A. The number of digits before the period is variable, but I would like to retain just the part of the string after the period for each row in the column. I would like to do something like:
df$new <- substring(df$id, 1 + indexOf(".", df$id))
So 12.23-45-6A would become 23-45-6A, 0.1B would become 1B, 4.A-A would become A-A and so on for an entire column.
Right now I have:
df$new <- substr(df$id, 1 + regexpr("\\\.", data.count$id),99)
Thanks for any advice.
As #AnandaMahto mentioned his comment, you would probably be better simplifying things and using gsub:
> x <- c("13.23-45-6A", "0.1B", "4.A-A")
> gsub("[0-9]*\\.(.*)", "\\1", x, perl = T, )
[1] "23-45-6A" "1B" "A-A"
To make this work with your existing data frame you can try:
df$id <- gsub("[0-9]*\\.(.*)", "\\1", df$id, perl = T, )
another way is to use strsplit. Using #Tims example
x <- c("13.23-45-6A", "0.1B", "4.A-A")
sapply(strsplit(x, "\\."), "[", -1)
"23-45-6A" "1B" "A-A"
You could remove the characters including the . using
sub('[^.]*\\.', '', x)
#[1] "23-45-6A" "1B" "A-A"
data
x <- c("13.23-45-6A", "0.1B", "4.A-A")

r regex Lookbehind Lookahead issue

I try to extract passages like 44.11.36.00-1 (precisely, nn.nn.nn.nn-n, where n stands for any number from 0-9) from text in R.
I want to extract passages if they are "sticked" to non-number marks:
44.11.36.00-1 extracted from nsfghstighsl44.11.36.00-1vsdfgh is OK
44.11.36.00-1 extracted from fa0044.11.36.00-1000 is NOT
I have read that str_extract_all is not working with Lookbehind and Lookahead expressions, so I sadly came back to grep, but cannot deal with it:
> pattern1 <- "(?<![0-9]{1})[0-9]{2}\\.[0-9]{2}\\.[0-9]{2}\\.[0-9]{2}-[0-9]{1}(?![0-9]{1})"
> grep(pattern1, "dyj44.11.36.00-1aregjspotgji 44113600-1 agdtklj441136001 ", perl=TRUE, value = TRUE)
[1] "dyj44.11.36.00-1aregjspotgji 44113600-1 agdtklj441136001 "
which is not the result I expected.
I thought that:
(?<![0-9]{1}) means "match expression which is not preceeded by a number"
[0-9]{2}\\.[0-9]{2}\\.[0-9]{2}\\.[0-9]{2}-[0-9]{1} stands for the expression I seek for
(?![0-9]{1}) means "match expression which is not followed by a number"
You don't actually need lookahead or lookbehind with this approach. Just parenthesize the portion you want extracted:
library(gsubfn)
x <- c("nsfghstighsl44.11.36.00-1vsdfgh", "fa0044.11.36.00-1000") # test data
pat <- "(^|\\D)(\\d{2}[.]\\d{2}[.]\\d{2}[.]\\d{2}-\\d)(\\D|$)"
strapply(x, pat, ~ ..2, simplify = c)
## "44.11.36.00-1"
Note that ~ ..2 is short for the function function(...) ..2 which means grab the match to the second parenthesized portion in the regular expression. We could also have written function(x, y, z) y or x + y + z ~ y .
Note: The question seems to say that a non-numeric must come directly before and after the string but based on comments that have since disappeared it appears that what was really wanted was that the string be either at the beginning or just after a non-number and must either be at the end or folowed by a non-number. The answer has been so modified.
AS #Roland said in his comment, you need to use regmatches instead of grep
> s <- "nsfghstighsl44.11.36.00-1vsdfgh"
> m <- gregexpr("(?<![0-9]{1})[0-9]{2}\\.[0-9]{2}\\.[0-9]{2}\\.[0-9]{2}-[0-9]{1}(?![0-9]{1})", s, perl=TRUE)
> regmatches(s, m)
[1] "44.11.36.00-1"
A reduced one,
> x <- c('nsfghstighsl44.11.36.00-1vsdfgh', 'fa0044.11.36.00-1000')
> m <- gregexpr("(?<!\\d)\\d{2}\\.\\d{2}\\.\\d{2}\\.\\d{2}-\\d(?!\\d)", x, perl=TRUE)
> regmatches(x, m)
[1] "44.11.36.00-1"

Regular expression to find and replace conditionally

I need to replace string A with string B, only when string A is a whole word (e.g. "MECH"), and I don't want to make the replacement when A is a part of a longer string (e.g. "MECHANICAL"). So far, I have a grepl() which checks if string A is a whole string, but I cannot figure out how to make the replacement. I have added an ifelse() with the idea to makes the gsub() replacement when grep() returns TRUE, otherwise not to replace. Any suggestions? Please see the code below. Thanks.
aa <- data.frame(type = c("CONSTR", "MECH CONSTRUCTION", "MECHANICAL CONSTRUCTION MECH", "MECH CONSTR", "MECHCONSTRUCTION"))
from <- c("MECH", "MECHANICAL", "CONSTR", "CONSTRUCTION")
to <- c("MECHANICAL", "MECHANICAL", "CONSTRUCTION", "CONSTRUCTION")
gsub2 <- function(pattern, replacement, x, ...) {
for(i in 1:length(pattern)){
reg <- paste0("(^", pattern[i], "$)|(^", pattern[i], " )|( ", pattern[i], "$)|( ", pattern[i], " )")
ifelse(grepl(reg, aa$type),
x <- gsub(pattern[i], replacement[i], x, ...),
aa$type)
}
x
}
aa$title3 <- gsub2(from, to, aa$type)
You can enclose the strings in the from vector in \\< and \\> to match only whole words:
x <- c("CONSTR", "MECH CONSTRUCTION", "MECHANICAL CONSTRUCTION MECH",
"MECH CONSTR", "MECHCONSTRUCTION")
from <- c("\\<MECH\\>", "\\<CONSTR\\>")
to <- c("MECHANICAL", "CONSTRUCTION")
for(i in 1:length(from)){
x <- gsub(from[i], to[i], x)
}
print(x)
# [1] "CONSTRUCTION" "MECHANICAL CONSTRUCTION"
# [3] "MECHANICAL CONSTRUCTION MECHANICAL" "MECHANICAL CONSTRUCTION"
# [5] "MECHCONSTRUCTION"
I use regex (?<=\W|^)MECH(?=\W|$) to get if inside the string contain whole word MECH like this.
Is that what you need?
Just for posterity, other than using the \< \> enclosure, a whole word can be defined as any string ending in a space or end-of-line (\s|$).
gsub("MECH(\\s|$)", "MECHANICAL\\1", aa$type)
The only problem with this approach is that you need to carry over the space or end-of-line that you used as part of the match, hence the encapsulation in parentheses and the backreference (\1).
The \< \> enclosure is superior for this particular question, since you have no special exceptions. However, if you have exceptions, it is better to use a more explicit method. The more tools in your toolbox, the better.