I want to replace all words that begin with a given character with a different word. Tried gsub and str_replace_all but with little success. In this example I want to replace all words starting with R with MM. gsub replaces properly only once:
gsub("^R*\\w+", "MM", "Red, Rome, Ralf")
# [1] "MM, Rome, Ralf"
Thanks in advance
You must either remove the string start anchor (^) or work with a vector of words:
gsub("\\bR\\w+", "MM", "Red, Rome, Ralf")
#[1] "MM, MM, MM"
gsub("^R\\w+", "MM", c("Red", "Rome", "Ralf"))
#[1] "MM" "MM" "MM"
Also, you probably want "R" instead of "R*", since the latter can match 0 or more instances of "R". The regexes above match only words with 2 or more characters, the first of which must be "R". The last regex only matches words at the beginning of the string.
Thanks #flodel for pointing out the missing word boundary "\b" in the first regex!
Related
I have a vector with the following elements:
myvec<- c("output.chr10.recalibrated", "output.chr11.recalibrated",
"output.chrY.recalibrated")
I want to selectively extract the value after chr and before .recalibrated and get the result.
Result:
10, 11, Y
You can do that with a mere sub:
> sub(".*?chr(.*?)\\.recalibrated.*", "\\1", myvec)
[1] "10" "11" "Y"
The pattern matches any symbols before the first chr, then matches and captures any characters up to the first .recalibrated, and then matches the rest of the characters. In the replacement pattern, we use a backreference \1 that inserts the captured value you need back into the resulting string.
See the regex demo
As an alternative, use str_match:
> library(stringr)
> str_match(myvec, "chr(.*?)\\.recalibrated")[,2]
[1] "10" "11" "Y"
It keeps all captured values and helps avoid costly unanchored lookarounds in the pattern that are necessary in str_extract.
The pattern means:
chr - match a sequence of literal characters chr
(.*?) - match any characters other than a newline (if you need to match newlines, too, add (?s) at the beginning of the pattern) up to the first
\\.recalibrated - .recalibrated literal character sequence.
Both answers failing in case of slightly different inputs like whatever.chr10.whateverelse.recalibrated here's my own approach only differing on the regex part with sub:
sub(".*[.]chr([^.]*)[.].*", "\\1", myvec)
what the regex does is:
.*[.]chr match as much as possible until finding '.chr' literraly
([^.]*) capture everything not a dot after chr (could be replaced by \\d+ to capture only numeric values, requiring at least one digit present
[.].* match the rest of the line after a literal dot
I prefer the character class escape of dots ([.]) on the backslash escape (\\.) as it's usually easier to read when you're back on the regex, that's my my opinion and not covered by any best practice I know of.
We can use str_extract to do this. We match one of more characters (.*) that follow 'chr' ((?<=chr)) and before the .recalibrated ((?=\\.recalibrated)).
library(stringr)
str_extract(myvec, "(?<=chr).*(?=\\.recalibrated)")
#[1] "10" "11" "Y"
Or use gsub to match the characters until chr or (|) that starts from .recalibrated to the end ($) of the string and replace it with ''.
gsub(".*\\.chr|\\.recalibrated.*$", "", myvec)
#[1] "10" "11" "Y"
Looks like XY problem. Why extract? If this is needed in further analysis steps, we could for example do this instead:
for(chrN in c(1:22, "X", "Y")) {
myVar <- paste0("output.chr", chrN, ".recalibrated")
#do some fun stuff with myVar
print(myVar)
}
I want to replace all the single character in my string with a blank. My idea is that there should be a space before and after the single character. So i have put spaces before and after the character but that doesn't seem to work. I also wanted to replace string with more than 1 char. i.e if i want to replace all char with length 2 or so, then how would the code change.
str="I have a cat of white color"
str=gsub("([[:space:]][[a-z]][[:space:]])", "", str)
I want to replace all the single character in my string with a blank. My idea is that there should be a space before and after the single character.
The idea is not correct, a word is not always surrounded with spaces. What if the words is at the beginning of the string? Or at the end? Or is followed with a punctuation?
Use \b word boundary:
There are three different positions that qualify as word boundaries:
- Before the first character in the string, if the first character is a word character.
- After the last character in the string, if the last character is a word character.
- Between two characters in the string, where one is a word character and the other is not a word character.
NOTE that in R, when you use gsub, it is best to use it with the PCRE regex (pass perl=T):
POSIX 1003.2 mode of gsub and gregexpr does not work correctly with repeated word-boundaries (e.g., pattern = "\b"). Use perl = TRUE for such matches (but that may not work as expected with non-ASCII inputs, as the meaning of ‘word’ is system-dependent).
So, to match all 1-letter words, you need to use
gsub("(?i)\\b[a-z]\\b", "REPLACEMENT", input, perl=T) ## To replace 1 ASCII letter words
Note that (?i) is a case-insensitive modifier (making a match both a and A).
Now, you need to match 2 letter words:
gsub("(?i)\\b[a-z]{2}\\b", "REPLACEMENT", input, perl=T) ## To replace 2 ASCII letter words
Here, we are using a limiting quantifier {min, max} / {max} to specify how many times the pattern quantified with this construct can be repeated.
See IDEONE demo:
> input = "I am a football fan"
> gsub("(?i)\\b[a-z]\\b", "REPLACEMENT", input, perl=T) ## To replace 1 ASCII letter words
[1] "REPLACEMENT am REPLACEMENT football fan"
gsub("(?i)\\b[a-z]{2}\\b", "REPLACEMENT", input, perl=T) ## To replace 2 ASCII letter words
[1] "I REPLACEMENT a football fan"
You need to use the quantifier regex property, e.g. [a-z]{2} which matches the letters a to z twice together. The regex pattern you want is something along the lines of this:
\\s[a-z]{2}\\s
You can build this regex dynamically in R using an input number of characters. Here is a code snippet which demonstrates this:
str <- "I have a cat of white color"
nchars <- 2
exp <- paste0("\\s[a-z]{", nchars, "}\\s")
> gsub(exp, "", str)
[1] "I have a catwhite color"
I have character strings with two underscores. Like these
c54254_g4545_i5454
c434_g4_i455
c5454_g544_i3
.
.
etc
I need to split these strings by the second underscore and I am afraid I have no clue how to do that in R (or any other tool for that sake). I'd be very happy if anyone can sort me out here.
Thank you
SM
One way would be to replace the second underscore by another delimiter (i.e. space) using sub and then split using that.
Using sub, we match one or more characters that are not a _ from the beginning (^) of the string (^[^_]+) followed by the first underscore (_) followed by one or characters that are not a _ ([^_]+). We capture that as a group by placing it inside the parentheses ((....)), then we match the _ followed by one or more characters till the end of the string in the second capture group ((.*)$). In the replacement, we separate the first (\\1) and second (\\2) with a space.
strsplit(sub('(^[^_]+_[^_]+)_(.*)$', '\\1 \\2', v1), ' ')
#[[1]]
#[1] "c54254_g4545" "i5454"
#[[2]]
#[1] "c434_g4" "i455"
#[[3]]
#[1] "c5454_g544" "i3"
data
v1 <- c('c54254_g4545_i5454', 'c434_g4_i455', 'c5454_g544_i3')
strsplit(sub("(_)(?=[^_]+$)", " ", x, perl=T), " ")
#[[1]]
#[1] "c54254_g4545" "i5454"
#
#[[2]]
#[1] "c434_g4" "i455"
#
#[[3]]
#[1] "c5454_g544" "i3"
With the pattern "(_)(?=[^_]+$)", we split on an underscore that comes before a series of one or more non-underscore characters. That way we only need one capture group.
I did this. However, although it works there may be a 'better' way?
str = 'c110478_g1_i1'
m = strsplit(str, '_')
f <- paste(m[[1]][1],m[[1]][2],sep='_')
I have strings of the following flavor:
Random Inc
A Non-Random Inc
I would like to remove the word Inc from all those strings wehre there are more than 1 word preceding it. The result on the above two examples would be:
Random Inc
A Non-Random
What is the right regex to plug into gsub for this? In particular, how does one specify complete words in regex? I thought it would be \w but this is a word character which does not seem correct.
\w matches a word character, but in this case it seems you need to account for the hyphen and use a quantifier.
x <- c('Random Inc', 'A Non-Random Inc', 'Another Inc', 'A Random other Inc')
sub('[\\w-]+ [\\w-]+\\K *Inc', '', x, perl=TRUE)
# [1] "Random Inc" "A Non-Random" "Another Inc" "A Random other"
First we match any character of word characters, hyphen "one or more" times followed by whitespace followed by word characters, hyphen "one or more" times. The \K escape sequence resets the starting point of the reported match and any previously consumed characters are no longer included. Then we match whitespace "zero or more" times followed by the word Inc. Since we use \K, we use an empty replacement because \K acts as a zero-width assertion.
You can use a regex like this:
([-\w]+\s+[-\w]+)\s+Inc
Working demo
I think you mean one or more non-space characters as complete word. If yes, then you could use \S+.
> x <- c('Random Inc', 'A Non-Random Inc', 'Another Inc', 'A Random other Inc')
> sub("^\\S+(?:\\s+\\S+)?$(*SKIP)(*F)|\\s+Inc\\b", "", x, perl=T)
[1] "Random Inc" "A Non-Random" "Another Inc" "A Random other"
^\\S+(?:\\s+\\S+)?$ Matches the line which has exactly one or two words.
(*SKIP)(*F) Causes the match to Fail.
| OR (ie, consider only the remaining part of the string)
\\s+Inc\\b Matche Inc and also the preceding one or more space characters.
I have some problems with different strings being concatenated and which I would like to split again.
I am dealing with things such as
name="o-n-Butylhydroxylamine1-MethylpropylhydroxylamineAmino-2-butanol"
which in this case should be split in
"o-n-Butylhydroxylamine", "1-Methylpropylhydroxylamine" and "Amino-2-butanol"
Any thoughts how I could use strsplit and/or gsub regular expression to achieve this?
The rule I would like to use is that I would like to split a word when either a number, a bracket ("(") or a capital letter follows a lower caps letter. Any thoughts how to do this?
You could use positive look-around assertions to find (and then split at) inter-character positions preceded by a lower case letter and succeeded by an upper case letter, a digit, or a (.
name <- "o-n-Butylhydroxylamine1-MethylpropylhydroxylamineAmino-2-butanol"
pat <- "(?<=[[:lower:]])(?=[[:upper:][:digit:](])"
strsplit(name, pat, perl=TRUE)
# [[1]]
# [1] "o-n-Butylhydroxylamine" "1-Methylpropylhydroxylamine"
# [3] "Amino-2-butanol"
strsplit(name, "(?<=([a-z]))(?=[A-Z]|[0-9]|\\()", perl=TRUE)
# [[1]]
# [1] "o-n-Butylhydroxylamine" "1-Methylpropylhydroxylamine" "Amino-2-butanol"
Remember that the return value is a list, so use [[1]] if appropriate.
Try this:
name="o-n-Butylhydroxylamine1-MethylpropylhydroxylamineAmino-2-butanol"
print(strsplit(gsub("([a-z])(\\d)","\\1#\\2",
gsub("([a-z])([A-Z])","\\1#\\2",name)),"#")[[1]])
It assumes a non-cap letter followed by a digit is a split as well as a non-cap followed by a cap.