R split a character string on the second underscore - regex

I have character strings with two underscores. Like these
c54254_g4545_i5454
c434_g4_i455
c5454_g544_i3
.
.
etc
I need to split these strings by the second underscore and I am afraid I have no clue how to do that in R (or any other tool for that sake). I'd be very happy if anyone can sort me out here.
Thank you
SM

One way would be to replace the second underscore by another delimiter (i.e. space) using sub and then split using that.
Using sub, we match one or more characters that are not a _ from the beginning (^) of the string (^[^_]+) followed by the first underscore (_) followed by one or characters that are not a _ ([^_]+). We capture that as a group by placing it inside the parentheses ((....)), then we match the _ followed by one or more characters till the end of the string in the second capture group ((.*)$). In the replacement, we separate the first (\\1) and second (\\2) with a space.
strsplit(sub('(^[^_]+_[^_]+)_(.*)$', '\\1 \\2', v1), ' ')
#[[1]]
#[1] "c54254_g4545" "i5454"
#[[2]]
#[1] "c434_g4" "i455"
#[[3]]
#[1] "c5454_g544" "i3"
data
v1 <- c('c54254_g4545_i5454', 'c434_g4_i455', 'c5454_g544_i3')

strsplit(sub("(_)(?=[^_]+$)", " ", x, perl=T), " ")
#[[1]]
#[1] "c54254_g4545" "i5454"
#
#[[2]]
#[1] "c434_g4" "i455"
#
#[[3]]
#[1] "c5454_g544" "i3"
With the pattern "(_)(?=[^_]+$)", we split on an underscore that comes before a series of one or more non-underscore characters. That way we only need one capture group.

I did this. However, although it works there may be a 'better' way?
str = 'c110478_g1_i1'
m = strsplit(str, '_')
f <- paste(m[[1]][1],m[[1]][2],sep='_')

Related

How to match a string and white space in R

I have a dataframe with columns having values like:
"Average 18.24" "Error 23.34". My objective is to replace the text and following space from these. in R. Can any body help me with a regex pattern to do this?
I am able to successfully do this using the [A-Z]. But i am not able to combine the white space. [A-Z][[:space:]] no luck.
Your help is appreciated.
We can use sub. Use the pattern \\D+ to match all non-numeric characters and then use '' in the replacement to remove those.
sub("\\D+", '', v2)
#[1] "18.24" "23.34"
Or match one or more word characters followed by one or more space and replace with ''.
sub("\\w+\\s+", "", v2)
#[1] "18.24" "23.34"
Or if we are using stringr
library(stringr)
word(v2, 2)
#[1] "18.24" "23.34"
data
v2 <- c("Average 18.24" ,"Error 23.34")
You can use a quantifier and add a-z to the pattern (and the ^ anchor)
You can use
"^\\S+\\s+"
"^[a-zA-Z]+[[:space:]]+"
See regex demo
R demo:
> b <- c("Average 18.24", "Error 23.34")
> sub("^[A-Za-z]+[[:space:]]+", "", b)
> ## or sub("^\\S+\\s+", "", b)
[1] "18.24" "23.34"
Details:
^ - start of string
[A-Za-z]+ - one or more letters (replace with \\S+ to match 1 or more non-whitespaces)
[[:space:]]+ - 1+ whitespaces (or \\s+ will match 1 or more whitespaces)

Remove part of column name

I have a df with column names of a.b.c.v1, d.e.f.v1, h.j.k.v1, and would like to remove v1 from all the column names of df.
I suppose I should use gsub but my trials with that were not successful.
We can use sub to remove the .v1 from the end of the string. (If we only need to remove 'v1', just remove the \\. from the pattern to match, but I think a . at the end of column name may not look that good). Here, we match the dot (\\.) followed by one of more characters that are not a dot ([^.]+) until the end of the string ($) and replace it with "".
colnames(df) <- sub('\\.[^.]+$', '', colnames(df))
colnames(df)
#[1] "a.b.c" "d.e.f" "h.j.k"

how to replace a single/double character in a string

I want to replace all the single character in my string with a blank. My idea is that there should be a space before and after the single character. So i have put spaces before and after the character but that doesn't seem to work. I also wanted to replace string with more than 1 char. i.e if i want to replace all char with length 2 or so, then how would the code change.
str="I have a cat of white color"
str=gsub("([[:space:]][[a-z]][[:space:]])", "", str)
I want to replace all the single character in my string with a blank. My idea is that there should be a space before and after the single character.
The idea is not correct, a word is not always surrounded with spaces. What if the words is at the beginning of the string? Or at the end? Or is followed with a punctuation?
Use \b word boundary:
There are three different positions that qualify as word boundaries:
- Before the first character in the string, if the first character is a word character.
- After the last character in the string, if the last character is a word character.
- Between two characters in the string, where one is a word character and the other is not a word character.
NOTE that in R, when you use gsub, it is best to use it with the PCRE regex (pass perl=T):
POSIX 1003.2 mode of gsub and gregexpr does not work correctly with repeated word-boundaries (e.g., pattern = "\b"). Use perl = TRUE for such matches (but that may not work as expected with non-ASCII inputs, as the meaning of ‘word’ is system-dependent).
So, to match all 1-letter words, you need to use
gsub("(?i)\\b[a-z]\\b", "REPLACEMENT", input, perl=T) ## To replace 1 ASCII letter words
Note that (?i) is a case-insensitive modifier (making a match both a and A).
Now, you need to match 2 letter words:
gsub("(?i)\\b[a-z]{2}\\b", "REPLACEMENT", input, perl=T) ## To replace 2 ASCII letter words
Here, we are using a limiting quantifier {min, max} / {max} to specify how many times the pattern quantified with this construct can be repeated.
See IDEONE demo:
> input = "I am a football fan"
> gsub("(?i)\\b[a-z]\\b", "REPLACEMENT", input, perl=T) ## To replace 1 ASCII letter words
[1] "REPLACEMENT am REPLACEMENT football fan"
gsub("(?i)\\b[a-z]{2}\\b", "REPLACEMENT", input, perl=T) ## To replace 2 ASCII letter words
[1] "I REPLACEMENT a football fan"
You need to use the quantifier regex property, e.g. [a-z]{2} which matches the letters a to z twice together. The regex pattern you want is something along the lines of this:
\\s[a-z]{2}\\s
You can build this regex dynamically in R using an input number of characters. Here is a code snippet which demonstrates this:
str <- "I have a cat of white color"
nchars <- 2
exp <- paste0("\\s[a-z]{", nchars, "}\\s")
> gsub(exp, "", str)
[1] "I have a catwhite color"

extract string after first occurrence of pattern AND before another pattern

I have the following string:
strings <- c("David, FC; Haramey, S; Devan, IA",
"Colin, Matthew J.; Haramey, S",
"Colin, Matthew")
If I want the last initials/givenname for all strings i can use the following:
sub(".*, ", "", strings)
[1] "IA" "S" "Matthew"
This removes everything before the last ", "
However, I am stuck on how to get the the first initials/givenname. I know have to remove everything before the first ", " but then I have to remove everything after any spaces, semicolons, if any.
To be clear the output I want is:
c("FC", "Matthew", "Matthew")
Any pointers would be great.
fiddling i can get the first surnames gsub( " .*$", "", strings )
You can use
> gsub( "^[^\\s,]+,\\s+([^;.\\s]+).*", "\\1", strings, perl=T)
[1] "FC" "Matthew" "Matthew"
See the regex demo
Explanation:
^ - start of string
[^\\s,]+ - 1 or more characters other than whitespace or ,
, - a literal comma
\\s+ - 1 or more whitespace
([^;.\\s]+) - Group 1 matching 1 or more characters other than ;, . or whitespace
.* - zero or more any character other than a newline
If you want to use a POSIX-like expression, replace \\s inside the character classes (inside [...]) with [:blank:] (or [:space:]):
gsub( "^[^[:blank:],]+,\\s+([^;.[:blank:]]+).*", "\\1", strings)

Replace words starting with particular character in R

I want to replace all words that begin with a given character with a different word. Tried gsub and str_replace_all but with little success. In this example I want to replace all words starting with R with MM. gsub replaces properly only once:
gsub("^R*\\w+", "MM", "Red, Rome, Ralf")
# [1] "MM, Rome, Ralf"
Thanks in advance
You must either remove the string start anchor (^) or work with a vector of words:
gsub("\\bR\\w+", "MM", "Red, Rome, Ralf")
#[1] "MM, MM, MM"
gsub("^R\\w+", "MM", c("Red", "Rome", "Ralf"))
#[1] "MM" "MM" "MM"
Also, you probably want "R" instead of "R*", since the latter can match 0 or more instances of "R". The regexes above match only words with 2 or more characters, the first of which must be "R". The last regex only matches words at the beginning of the string.
Thanks #flodel for pointing out the missing word boundary "\b" in the first regex!