extract string after first occurrence of pattern AND before another pattern - regex

I have the following string:
strings <- c("David, FC; Haramey, S; Devan, IA",
"Colin, Matthew J.; Haramey, S",
"Colin, Matthew")
If I want the last initials/givenname for all strings i can use the following:
sub(".*, ", "", strings)
[1] "IA" "S" "Matthew"
This removes everything before the last ", "
However, I am stuck on how to get the the first initials/givenname. I know have to remove everything before the first ", " but then I have to remove everything after any spaces, semicolons, if any.
To be clear the output I want is:
c("FC", "Matthew", "Matthew")
Any pointers would be great.
fiddling i can get the first surnames gsub( " .*$", "", strings )

You can use
> gsub( "^[^\\s,]+,\\s+([^;.\\s]+).*", "\\1", strings, perl=T)
[1] "FC" "Matthew" "Matthew"
See the regex demo
Explanation:
^ - start of string
[^\\s,]+ - 1 or more characters other than whitespace or ,
, - a literal comma
\\s+ - 1 or more whitespace
([^;.\\s]+) - Group 1 matching 1 or more characters other than ;, . or whitespace
.* - zero or more any character other than a newline
If you want to use a POSIX-like expression, replace \\s inside the character classes (inside [...]) with [:blank:] (or [:space:]):
gsub( "^[^[:blank:],]+,\\s+([^;.[:blank:]]+).*", "\\1", strings)

Related

How to match a string and white space in R

I have a dataframe with columns having values like:
"Average 18.24" "Error 23.34". My objective is to replace the text and following space from these. in R. Can any body help me with a regex pattern to do this?
I am able to successfully do this using the [A-Z]. But i am not able to combine the white space. [A-Z][[:space:]] no luck.
Your help is appreciated.
We can use sub. Use the pattern \\D+ to match all non-numeric characters and then use '' in the replacement to remove those.
sub("\\D+", '', v2)
#[1] "18.24" "23.34"
Or match one or more word characters followed by one or more space and replace with ''.
sub("\\w+\\s+", "", v2)
#[1] "18.24" "23.34"
Or if we are using stringr
library(stringr)
word(v2, 2)
#[1] "18.24" "23.34"
data
v2 <- c("Average 18.24" ,"Error 23.34")
You can use a quantifier and add a-z to the pattern (and the ^ anchor)
You can use
"^\\S+\\s+"
"^[a-zA-Z]+[[:space:]]+"
See regex demo
R demo:
> b <- c("Average 18.24", "Error 23.34")
> sub("^[A-Za-z]+[[:space:]]+", "", b)
> ## or sub("^\\S+\\s+", "", b)
[1] "18.24" "23.34"
Details:
^ - start of string
[A-Za-z]+ - one or more letters (replace with \\S+ to match 1 or more non-whitespaces)
[[:space:]]+ - 1+ whitespaces (or \\s+ will match 1 or more whitespaces)

Rearrange a character string

I have a character vector where some entries have a certain pattern at the end. I want to remove this pattern from the end and put it in front of the rest.
Example:
#My initial character vector
names <- c("sdadohf abc", "fsdgodhgf abc", "afhk xyz")
> names
[1] "sdadohf abc" "fsdgodhgf abc" "afhk xyz"
#What I want is to move "abc" to the front
> names
[1] "abc sdadohf" "abc fsdgodhgf" "afhk xyz"
Is there an easy way to achive this or do I have to write an own function?
First let's add one more string to your vector, one with multiple spaces between the text.
names <- c("sdadohf abc", "fsdgodhgf abc", "afhk xyz", "aksle abc")
You could use capturing groups in sub().
sub("(.*?)\\s+(abc)$", "\\2 \\1", names)
# [1] "abc sdadohf" "abc fsdgodhgf" "afhk xyz" "abc aksle"
Regex explanation courtesy of regex101:
(.*) 1st Capturing group - matches any character (except newline) between zero and unlimited times, as few times as possible, expanding as needed
\\s+ matches any white space character [\r\n\t\f ] between one and unlimited times, as many times as possible, giving back as needed
(abc) 2nd Capturing group - abc matches the characters abc literally, and $ asserts position at end of the string
When we swap the groups in "\\2 \\1", we bring the second capturing group abc to the beginning of the string.
Thanks to #Jota and #docendodiscimus for helping to improve my original regular expression.
Here is a split method. We split the 'names' by one or more space (\\s+) followed by 'abc' ((?=abc)), loop through the list with vapply, reverse (rev) the list elements and paste it together.
vapply(strsplit(names, "\\s+(?=abc)", perl=TRUE), function(x)
paste(rev(x), collapse=" "), character(1))
#[1] "abc sdadohf" "abc fsdgodhgf" "afhk xyz" "abc aksle"
data
names <- c("sdadohf abc", "fsdgodhgf abc", "afhk xyz", "aksle abc")
Use this
sub("(.*) \\b(abc)$", "\\2 \\1", names)
.* is a greedy match. It will match as much as it can before finding the string ending with abc.
.* is in first captured group(\\1)
abc is in second captured group(\\2)
We can just interchange their position using \\2 \\1 to find our resultant string

Remove letters matching pattern before and after the required string

I have a vector with the following elements:
myvec<- c("output.chr10.recalibrated", "output.chr11.recalibrated",
"output.chrY.recalibrated")
I want to selectively extract the value after chr and before .recalibrated and get the result.
Result:
10, 11, Y
You can do that with a mere sub:
> sub(".*?chr(.*?)\\.recalibrated.*", "\\1", myvec)
[1] "10" "11" "Y"
The pattern matches any symbols before the first chr, then matches and captures any characters up to the first .recalibrated, and then matches the rest of the characters. In the replacement pattern, we use a backreference \1 that inserts the captured value you need back into the resulting string.
See the regex demo
As an alternative, use str_match:
> library(stringr)
> str_match(myvec, "chr(.*?)\\.recalibrated")[,2]
[1] "10" "11" "Y"
It keeps all captured values and helps avoid costly unanchored lookarounds in the pattern that are necessary in str_extract.
The pattern means:
chr - match a sequence of literal characters chr
(.*?) - match any characters other than a newline (if you need to match newlines, too, add (?s) at the beginning of the pattern) up to the first
\\.recalibrated - .recalibrated literal character sequence.
Both answers failing in case of slightly different inputs like whatever.chr10.whateverelse.recalibrated here's my own approach only differing on the regex part with sub:
sub(".*[.]chr([^.]*)[.].*", "\\1", myvec)
what the regex does is:
.*[.]chr match as much as possible until finding '.chr' literraly
([^.]*) capture everything not a dot after chr (could be replaced by \\d+ to capture only numeric values, requiring at least one digit present
[.].* match the rest of the line after a literal dot
I prefer the character class escape of dots ([.]) on the backslash escape (\\.) as it's usually easier to read when you're back on the regex, that's my my opinion and not covered by any best practice I know of.
We can use str_extract to do this. We match one of more characters (.*) that follow 'chr' ((?<=chr)) and before the .recalibrated ((?=\\.recalibrated)).
library(stringr)
str_extract(myvec, "(?<=chr).*(?=\\.recalibrated)")
#[1] "10" "11" "Y"
Or use gsub to match the characters until chr or (|) that starts from .recalibrated to the end ($) of the string and replace it with ''.
gsub(".*\\.chr|\\.recalibrated.*$", "", myvec)
#[1] "10" "11" "Y"
Looks like XY problem. Why extract? If this is needed in further analysis steps, we could for example do this instead:
for(chrN in c(1:22, "X", "Y")) {
myVar <- paste0("output.chr", chrN, ".recalibrated")
#do some fun stuff with myVar
print(myVar)
}

how to replace a single/double character in a string

I want to replace all the single character in my string with a blank. My idea is that there should be a space before and after the single character. So i have put spaces before and after the character but that doesn't seem to work. I also wanted to replace string with more than 1 char. i.e if i want to replace all char with length 2 or so, then how would the code change.
str="I have a cat of white color"
str=gsub("([[:space:]][[a-z]][[:space:]])", "", str)
I want to replace all the single character in my string with a blank. My idea is that there should be a space before and after the single character.
The idea is not correct, a word is not always surrounded with spaces. What if the words is at the beginning of the string? Or at the end? Or is followed with a punctuation?
Use \b word boundary:
There are three different positions that qualify as word boundaries:
- Before the first character in the string, if the first character is a word character.
- After the last character in the string, if the last character is a word character.
- Between two characters in the string, where one is a word character and the other is not a word character.
NOTE that in R, when you use gsub, it is best to use it with the PCRE regex (pass perl=T):
POSIX 1003.2 mode of gsub and gregexpr does not work correctly with repeated word-boundaries (e.g., pattern = "\b"). Use perl = TRUE for such matches (but that may not work as expected with non-ASCII inputs, as the meaning of ‘word’ is system-dependent).
So, to match all 1-letter words, you need to use
gsub("(?i)\\b[a-z]\\b", "REPLACEMENT", input, perl=T) ## To replace 1 ASCII letter words
Note that (?i) is a case-insensitive modifier (making a match both a and A).
Now, you need to match 2 letter words:
gsub("(?i)\\b[a-z]{2}\\b", "REPLACEMENT", input, perl=T) ## To replace 2 ASCII letter words
Here, we are using a limiting quantifier {min, max} / {max} to specify how many times the pattern quantified with this construct can be repeated.
See IDEONE demo:
> input = "I am a football fan"
> gsub("(?i)\\b[a-z]\\b", "REPLACEMENT", input, perl=T) ## To replace 1 ASCII letter words
[1] "REPLACEMENT am REPLACEMENT football fan"
gsub("(?i)\\b[a-z]{2}\\b", "REPLACEMENT", input, perl=T) ## To replace 2 ASCII letter words
[1] "I REPLACEMENT a football fan"
You need to use the quantifier regex property, e.g. [a-z]{2} which matches the letters a to z twice together. The regex pattern you want is something along the lines of this:
\\s[a-z]{2}\\s
You can build this regex dynamically in R using an input number of characters. Here is a code snippet which demonstrates this:
str <- "I have a cat of white color"
nchars <- 2
exp <- paste0("\\s[a-z]{", nchars, "}\\s")
> gsub(exp, "", str)
[1] "I have a catwhite color"

Regex leading space/add trailing space before/to punctuation

To better clean my forum message corpus, I would like to remove the leading spaces before punctuation and add one after if needed, using two regular expressions. The latter was no problem ((?<=[.,!?()])(?! )) but I've some problem with the first at least.
I used this expression: \s([?.!,;:"](?:\s|$))
But it's by far not flexible enough:
It matches even if there's already a space(or more) before the punctuation character
It doesn't match if there's not a space after the punctuation character
It doesn't match any unlisted punctuation character (but I guess I can use [:punct:] for that, at the end of the day)
Finally, both matches the decimal points (while they should not)
How can I eventually rewrite the expression to meet my needs?
Example Strings and expected output
This is the end .Hello world! # This is the end. Hello world! (remove the leading, add the trailing)
This is the end, Hello world! # This is the end, Hello world! (ok!)
This is the end . Hello world! # This is the end. Hello world! (remove the leading, ok the trailing)
This is a .15mm tube # This is a .15 mm tube (ok since it's a decimal point)
Use \p{P} to match all the punctuations. Use \h* instead of \s* because \s would match newline characters also.
(?<!\d)\h*(\p{P}+)\h*(?!\d)
Replace the matched strings by \1<space>
DEMO
> x <- c('This is the end .Stuff', 'This is the end, Stuff', 'This is the end . Stuff', 'This is a .15mm tube')
> gsub("(?<!\\d)\\h*(\\p{P}+)\\h*(?!\\d)", "\\1 ", x, perl=T)
[1] "This is the end. Stuff" "This is the end, Stuff" "This is the end. Stuff"
[4] "This is a .15mm tube"
Here's an expression that detects the substrings that need to be replaced:
\s*\.\s*(?!\d)
You need to replace these by: . (a dot and a space)
Here's a demo link of how this works: http://regex101.com/r/zB2bY3/1
Explanation of the regex:
\s* - matches whitespace, any number of chars (0 - unbounded)
\. - matches a dot
\s* - same as above
(?!\d) - negative lookahead. It means that the string, in order to be matched, must not be followed by a digit (this handles your last test case).