I have strings of the following flavor:
Random Inc
A Non-Random Inc
I would like to remove the word Inc from all those strings wehre there are more than 1 word preceding it. The result on the above two examples would be:
Random Inc
A Non-Random
What is the right regex to plug into gsub for this? In particular, how does one specify complete words in regex? I thought it would be \w but this is a word character which does not seem correct.
\w matches a word character, but in this case it seems you need to account for the hyphen and use a quantifier.
x <- c('Random Inc', 'A Non-Random Inc', 'Another Inc', 'A Random other Inc')
sub('[\\w-]+ [\\w-]+\\K *Inc', '', x, perl=TRUE)
# [1] "Random Inc" "A Non-Random" "Another Inc" "A Random other"
First we match any character of word characters, hyphen "one or more" times followed by whitespace followed by word characters, hyphen "one or more" times. The \K escape sequence resets the starting point of the reported match and any previously consumed characters are no longer included. Then we match whitespace "zero or more" times followed by the word Inc. Since we use \K, we use an empty replacement because \K acts as a zero-width assertion.
You can use a regex like this:
([-\w]+\s+[-\w]+)\s+Inc
Working demo
I think you mean one or more non-space characters as complete word. If yes, then you could use \S+.
> x <- c('Random Inc', 'A Non-Random Inc', 'Another Inc', 'A Random other Inc')
> sub("^\\S+(?:\\s+\\S+)?$(*SKIP)(*F)|\\s+Inc\\b", "", x, perl=T)
[1] "Random Inc" "A Non-Random" "Another Inc" "A Random other"
^\\S+(?:\\s+\\S+)?$ Matches the line which has exactly one or two words.
(*SKIP)(*F) Causes the match to Fail.
| OR (ie, consider only the remaining part of the string)
\\s+Inc\\b Matche Inc and also the preceding one or more space characters.
Related
I have a vector with the following elements:
myvec<- c("output.chr10.recalibrated", "output.chr11.recalibrated",
"output.chrY.recalibrated")
I want to selectively extract the value after chr and before .recalibrated and get the result.
Result:
10, 11, Y
You can do that with a mere sub:
> sub(".*?chr(.*?)\\.recalibrated.*", "\\1", myvec)
[1] "10" "11" "Y"
The pattern matches any symbols before the first chr, then matches and captures any characters up to the first .recalibrated, and then matches the rest of the characters. In the replacement pattern, we use a backreference \1 that inserts the captured value you need back into the resulting string.
See the regex demo
As an alternative, use str_match:
> library(stringr)
> str_match(myvec, "chr(.*?)\\.recalibrated")[,2]
[1] "10" "11" "Y"
It keeps all captured values and helps avoid costly unanchored lookarounds in the pattern that are necessary in str_extract.
The pattern means:
chr - match a sequence of literal characters chr
(.*?) - match any characters other than a newline (if you need to match newlines, too, add (?s) at the beginning of the pattern) up to the first
\\.recalibrated - .recalibrated literal character sequence.
Both answers failing in case of slightly different inputs like whatever.chr10.whateverelse.recalibrated here's my own approach only differing on the regex part with sub:
sub(".*[.]chr([^.]*)[.].*", "\\1", myvec)
what the regex does is:
.*[.]chr match as much as possible until finding '.chr' literraly
([^.]*) capture everything not a dot after chr (could be replaced by \\d+ to capture only numeric values, requiring at least one digit present
[.].* match the rest of the line after a literal dot
I prefer the character class escape of dots ([.]) on the backslash escape (\\.) as it's usually easier to read when you're back on the regex, that's my my opinion and not covered by any best practice I know of.
We can use str_extract to do this. We match one of more characters (.*) that follow 'chr' ((?<=chr)) and before the .recalibrated ((?=\\.recalibrated)).
library(stringr)
str_extract(myvec, "(?<=chr).*(?=\\.recalibrated)")
#[1] "10" "11" "Y"
Or use gsub to match the characters until chr or (|) that starts from .recalibrated to the end ($) of the string and replace it with ''.
gsub(".*\\.chr|\\.recalibrated.*$", "", myvec)
#[1] "10" "11" "Y"
Looks like XY problem. Why extract? If this is needed in further analysis steps, we could for example do this instead:
for(chrN in c(1:22, "X", "Y")) {
myVar <- paste0("output.chr", chrN, ".recalibrated")
#do some fun stuff with myVar
print(myVar)
}
I want to replace all the single character in my string with a blank. My idea is that there should be a space before and after the single character. So i have put spaces before and after the character but that doesn't seem to work. I also wanted to replace string with more than 1 char. i.e if i want to replace all char with length 2 or so, then how would the code change.
str="I have a cat of white color"
str=gsub("([[:space:]][[a-z]][[:space:]])", "", str)
I want to replace all the single character in my string with a blank. My idea is that there should be a space before and after the single character.
The idea is not correct, a word is not always surrounded with spaces. What if the words is at the beginning of the string? Or at the end? Or is followed with a punctuation?
Use \b word boundary:
There are three different positions that qualify as word boundaries:
- Before the first character in the string, if the first character is a word character.
- After the last character in the string, if the last character is a word character.
- Between two characters in the string, where one is a word character and the other is not a word character.
NOTE that in R, when you use gsub, it is best to use it with the PCRE regex (pass perl=T):
POSIX 1003.2 mode of gsub and gregexpr does not work correctly with repeated word-boundaries (e.g., pattern = "\b"). Use perl = TRUE for such matches (but that may not work as expected with non-ASCII inputs, as the meaning of ‘word’ is system-dependent).
So, to match all 1-letter words, you need to use
gsub("(?i)\\b[a-z]\\b", "REPLACEMENT", input, perl=T) ## To replace 1 ASCII letter words
Note that (?i) is a case-insensitive modifier (making a match both a and A).
Now, you need to match 2 letter words:
gsub("(?i)\\b[a-z]{2}\\b", "REPLACEMENT", input, perl=T) ## To replace 2 ASCII letter words
Here, we are using a limiting quantifier {min, max} / {max} to specify how many times the pattern quantified with this construct can be repeated.
See IDEONE demo:
> input = "I am a football fan"
> gsub("(?i)\\b[a-z]\\b", "REPLACEMENT", input, perl=T) ## To replace 1 ASCII letter words
[1] "REPLACEMENT am REPLACEMENT football fan"
gsub("(?i)\\b[a-z]{2}\\b", "REPLACEMENT", input, perl=T) ## To replace 2 ASCII letter words
[1] "I REPLACEMENT a football fan"
You need to use the quantifier regex property, e.g. [a-z]{2} which matches the letters a to z twice together. The regex pattern you want is something along the lines of this:
\\s[a-z]{2}\\s
You can build this regex dynamically in R using an input number of characters. Here is a code snippet which demonstrates this:
str <- "I have a cat of white color"
nchars <- 2
exp <- paste0("\\s[a-z]{", nchars, "}\\s")
> gsub(exp, "", str)
[1] "I have a catwhite color"
I'm looking for a regular expression to catch all digits in the first 7 characters in a string.
This string has 12 characters:
A12B345CD678
I would like to remove A and B only since they are within the first 7 chars (A12B345) and get
12345CD678
So, the CD678 should not be touched. My current solution in R:
paste(paste(str_extract_all(substr("A12B345CD678",1,7), "[0-9]+")[[1]],collapse=""),substr("A12B345CD678",8,nchar("A12B345CD678")),sep="")
It seems too complicated. I split the string at 7 as described, match any digits in the first 7 characters and bind it with the rest of the string.
Looking for a general answer, my current solution is to split the first 7 characters and just match all digits in this sub string.
Any help appreciated.
You can use the known SKIP-FAIL regex trick to match all the rest of the string beginning with the 8th character, and only match non-digit characters within the first 7 with a lookbehind:
s <- "A12B345CD678"
gsub("(?<=.{7}).*$(*SKIP)(*F)|\\D", "", s, perl=T)
## => [1] "12345CD678"
See IDEONE demo
The perl=T is required for this regex to work. The regex breakdown:
(?<=.{7}).*$(*SKIP)(*F) - matches any character but a newline (add (?s) at the beginning if you have newline symbols in the input), as many as possible (.*) up to the end ($, also \\z might be required to remove final newlines), but only if preceded with 7 characters (this is set by the lookbehind (?<=.{7})). The (*SKIP)(*F) verbs make the engine omit the whole matched text and advance the regex index to the position at the end of that text.
| - or...
\\D - a non-digit character.
See the regex demo.
The regex solution is cool, but I'd use something easier to read for maintainability. E.g.
library(stringr)
str_sub(s, 1, 7) = gsub('[A-Z]', '', str_sub(s, 1, 7))
You can also use a simple negative lookbehind:
s <- "A12B345CD678"
gsub("(?<!.{7})\\D", "", s, perl=T)
I need to find a regexp that allows me to find strings in which i have all the required numbers but only once.
For example:
a <- c("12","13","112","123","113","1123","23","212","223","213","2123","312","323","313","3123","1223","1213","12123","2313","23123","13123")
I want to get:
"123" "213" "312"
The pattern 123 only once and in any order and in any position of the string
I tried a lot of things and this seemed to be the closer while it's still very far from what I want :
grep('[1:3][1:3][1:3]', a, value=TRUE)
[1] "113" "313" "2313" "13123"
What i exactly need is to find all 3 digit numbers containing 1 2 AND 3 digits
Then you can safely use
grep('^[123]{3}$', a, value=TRUE)
##=> [1] "112" "123" "113" "212" "223" "213" "312" "323" "313"
The regex matches:
^ - start of string
[123]{3} - Exactly 3 characters that are either 1, or 2 or 3
$ - assert the position at the end of string.
Also, if you only need unique values, use unique.
If you do not need to allow the same digit more than once, you need a Perl-based regex:
grep('^(?!.*(.).*\\1)[123]{3}$', a, value=TRUE, perl=T)
## => [1] "123" "213" "312"
Note the double escaped back-reference. The (?!.*(.).*\\1) negative look-ahead will check if the string has no repeated symbols with the help of a capturing group (.) and a back-reference that forces the same captured text to appear in the string. If the same characters are found, there will be no match. See IDEONE demo.
The (?!.*(.).*\\1) is a negative look-ahead. It only asserts the absence of some pattern after the current regex engine position, i.e. it checks and returns true if there is no match, otherwise it returns false. Thus, it does not not "consume" characters, it does not "match" the pattern inside the look-ahead, the regex engine stays at the same location in the input string. In this regex, it is the beginning of string (^). So, right at the beginning of the string, the regex engine starts looking for .* (any character but a newline, 0 or more repetitions), then captures 1 character (.) into group 1, again matches 0 or more characters with .*, and then tries to match the same text inside group 1 with \\1. Thus, if there is 121, there will be no match since the look-ahead will return false as it will find two 1s.
you can as well use this
grep('^([123])((?!\\1)\\d)(?!\\2|\\1)\\d', a, value=TRUE, perl=T)
see demo
Consider the following vector x
x <- c("000a000b000c", "abcd00ab", "abcdefg", "000s00r00g00t00")
Using a single regular expression, I'd like to keep only those elements of x that contain more than three letters. Here are the rules:
The letters are not always consecutive (this is the main issue)
The string elements of x can be of any number of characters
There will be nothing in the string except digits and lower-case letters
The simple way I thought of would be to remove everything that is not a letter and then take the number of characters, something like the following.
x[nchar(gsub("[0-9]+", "", x)) > 3]
# [1] "abcd00ab" "abcdefg" "000s00r00g00t00"
I know that there are statements like [a-z]{4,} that finds four or more consecutive lower-case letters. But what if individual letters are scattered about the string? How can I keep a "running count" of letters such that when it passes three, it becomes a non-match? Right now all I can think of is to write [a-z]+ a bunch of times, but this can get ugly if I want to match say, five or more letters.
This gets me there, but you can see how this could be ugly for longer strings.
grep("[a-z]+.*[a-z]+.*[a-z]+.*[a-z]+.*", x)
# [1] 2 3 4
Is there a way to do that with a better regular expression?
Try this where \\D matches a non-digit, .* matches a string of 0 or more characters and (...){4} says to match four times, i.e. more than 3.
grep("(\\D.*){4}", x, value = TRUE)
This will match if there are 4 or any greater number of non-digits. Just replace 4 with 6 if you need more than 5. If its important to have the number 3 in the regexp then try this pattern (\\D.*){3}\\D instead.
There is a repetition operator you can use: {n} matches the previous token or group n times. To make matches more efficient, you should also be specific in what may be matched between letters (in your case only digits, not "any" character (which the dot . matches)):
^(?:[0-9]*[a-z]){4}[0-9a-z]*$
matches all strings that contain at least 3 lowercase letters.
Explanation:
^ # Start of string
(?: # Start of a (non-capturing) group:
[0-9]* # Match any number of digits
[a-z] # Match one lowercase ASCII letter
){4} # Repeat the group exactly four times
[0-9a-z]* # Then match any following digits/letters
$ # until the end of the string
In R:
grep("^(?:[0-9]*[a-z]){4}[0-9a-z]*$", x, perl=TRUE, value=TRUE);
gives you a character vector with all the elements that are matches by the regex.
The below grep command would find the elements which has four or more letters
> grep("^(?:[^a-z]*[a-z]){4}", x, perl=T, value=T)
[1] "abcd00ab" "abcdefg" "000s00r00g00t00"
OR
> grep("^(?:[^a-z]*[a-z]){3}[^a-z]*[a-z]", x, perl=T, value=T)
[1] "abcd00ab" "abcdefg" "000s00r00g00t00"
To find the elements which has 5 or more letters,
> grep("^(?:[^a-z]*[a-z]){5}", x, perl=T, value=T)
[1] "abcd00ab" "abcdefg"
Explanation:
^ the beginning of the string
(?: group, but do not capture (4 times):
[^a-z]* any character except: 'a' to 'z' (0 or
more times)
[a-z] any character of: 'a' to 'z'
){4} end of grouping