Remove any digit only in first N characters - regex

I'm looking for a regular expression to catch all digits in the first 7 characters in a string.
This string has 12 characters:
A12B345CD678
I would like to remove A and B only since they are within the first 7 chars (A12B345) and get
12345CD678
So, the CD678 should not be touched. My current solution in R:
paste(paste(str_extract_all(substr("A12B345CD678",1,7), "[0-9]+")[[1]],collapse=""),substr("A12B345CD678",8,nchar("A12B345CD678")),sep="‌​")
It seems too complicated. I split the string at 7 as described, match any digits in the first 7 characters and bind it with the rest of the string.
Looking for a general answer, my current solution is to split the first 7 characters and just match all digits in this sub string.
Any help appreciated.

You can use the known SKIP-FAIL regex trick to match all the rest of the string beginning with the 8th character, and only match non-digit characters within the first 7 with a lookbehind:
s <- "A12B345CD678"
gsub("(?<=.{7}).*$(*SKIP)(*F)|\\D", "", s, perl=T)
## => [1] "12345CD678"
See IDEONE demo
The perl=T is required for this regex to work. The regex breakdown:
(?<=.{7}).*$(*SKIP)(*F) - matches any character but a newline (add (?s) at the beginning if you have newline symbols in the input), as many as possible (.*) up to the end ($, also \\z might be required to remove final newlines), but only if preceded with 7 characters (this is set by the lookbehind (?<=.{7})). The (*SKIP)(*F) verbs make the engine omit the whole matched text and advance the regex index to the position at the end of that text.
| - or...
\\D - a non-digit character.
See the regex demo.

The regex solution is cool, but I'd use something easier to read for maintainability. E.g.
library(stringr)
str_sub(s, 1, 7) = gsub('[A-Z]', '', str_sub(s, 1, 7))

You can also use a simple negative lookbehind:
s <- "A12B345CD678"
gsub("(?<!.{7})\\D", "", s, perl=T)

Related

Ruby Regex - If the string is more than 10 characters, remove the first character if it is a "1"

Without using a gem, I just want to write a simple regex formula to remove the first character from strings if it's a 1, and, if there are more than 10 total characters in the string. I never expect more than 11 characters, 11 should be the max. But in the case there are 10 characters and the string begins with "1", I don't want to remove it.
str = "19097147835"
str&.remove(/\D/).sub(/^1\d{10}$/, "\1").to_i
Returns 0
I'm looking for it to return "9097147835"
You could use your pattern, but add a capture group around the 10 digits to use the group in the replacement.
\A1(\d{10})\z
For example
str = "19097147835"
puts str.gsub(/\D/, '').sub(/\A1(\d{10})\z/, '\1').to_i
Output
9097147835
Another option could be removing all the non digits, and match the last 10 digits:
\A1\K\d{10}\z
\A Start of string
1\K Match 1 and forget what is matched so far
\d{10} Match 10 digits
\z End of string
Regex demo | Ruby demo
str = "19097147835"
str.gsub(/\D/, '').match(/\A1\K\d{10}\z/) do |match|
puts match[0].to_i
end
Output
9097147835
You can use
str.gsub(/\D/, '').sub(/\A1(?=\d{10})/, '').to_i
See the Ruby demo and the regex demo.
The regex matches
\A - start of string
1 - a 1
(?=\d{10}) - immediately to the right of the current location, there must be 10 digits.
Non regex example:
str = str[1..] if (str.start_with?("1") and str.size > 10)
Regexes are powerful, but not easy to maintain.

Remove letters matching pattern before and after the required string

I have a vector with the following elements:
myvec<- c("output.chr10.recalibrated", "output.chr11.recalibrated",
"output.chrY.recalibrated")
I want to selectively extract the value after chr and before .recalibrated and get the result.
Result:
10, 11, Y
You can do that with a mere sub:
> sub(".*?chr(.*?)\\.recalibrated.*", "\\1", myvec)
[1] "10" "11" "Y"
The pattern matches any symbols before the first chr, then matches and captures any characters up to the first .recalibrated, and then matches the rest of the characters. In the replacement pattern, we use a backreference \1 that inserts the captured value you need back into the resulting string.
See the regex demo
As an alternative, use str_match:
> library(stringr)
> str_match(myvec, "chr(.*?)\\.recalibrated")[,2]
[1] "10" "11" "Y"
It keeps all captured values and helps avoid costly unanchored lookarounds in the pattern that are necessary in str_extract.
The pattern means:
chr - match a sequence of literal characters chr
(.*?) - match any characters other than a newline (if you need to match newlines, too, add (?s) at the beginning of the pattern) up to the first
\\.recalibrated - .recalibrated literal character sequence.
Both answers failing in case of slightly different inputs like whatever.chr10.whateverelse.recalibrated here's my own approach only differing on the regex part with sub:
sub(".*[.]chr([^.]*)[.].*", "\\1", myvec)
what the regex does is:
.*[.]chr match as much as possible until finding '.chr' literraly
([^.]*) capture everything not a dot after chr (could be replaced by \\d+ to capture only numeric values, requiring at least one digit present
[.].* match the rest of the line after a literal dot
I prefer the character class escape of dots ([.]) on the backslash escape (\\.) as it's usually easier to read when you're back on the regex, that's my my opinion and not covered by any best practice I know of.
We can use str_extract to do this. We match one of more characters (.*) that follow 'chr' ((?<=chr)) and before the .recalibrated ((?=\\.recalibrated)).
library(stringr)
str_extract(myvec, "(?<=chr).*(?=\\.recalibrated)")
#[1] "10" "11" "Y"
Or use gsub to match the characters until chr or (|) that starts from .recalibrated to the end ($) of the string and replace it with ''.
gsub(".*\\.chr|\\.recalibrated.*$", "", myvec)
#[1] "10" "11" "Y"
Looks like XY problem. Why extract? If this is needed in further analysis steps, we could for example do this instead:
for(chrN in c(1:22, "X", "Y")) {
myVar <- paste0("output.chr", chrN, ".recalibrated")
#do some fun stuff with myVar
print(myVar)
}

regexp - find numbers in a string in any order

I need to find a regexp that allows me to find strings in which i have all the required numbers but only once.
For example:
a <- c("12","13","112","123","113","1123","23","212","223","213","2123","312","323","313","3123","1223","1213","12123","2313","23123","13123")
I want to get:
"123" "213" "312"
The pattern 123 only once and in any order and in any position of the string
I tried a lot of things and this seemed to be the closer while it's still very far from what I want :
grep('[1:3][1:3][1:3]', a, value=TRUE)
[1] "113" "313" "2313" "13123"
What i exactly need is to find all 3 digit numbers containing 1 2 AND 3 digits
Then you can safely use
grep('^[123]{3}$', a, value=TRUE)
##=> [1] "112" "123" "113" "212" "223" "213" "312" "323" "313"
The regex matches:
^ - start of string
[123]{3} - Exactly 3 characters that are either 1, or 2 or 3
$ - assert the position at the end of string.
Also, if you only need unique values, use unique.
If you do not need to allow the same digit more than once, you need a Perl-based regex:
grep('^(?!.*(.).*\\1)[123]{3}$', a, value=TRUE, perl=T)
## => [1] "123" "213" "312"
Note the double escaped back-reference. The (?!.*(.).*\\1) negative look-ahead will check if the string has no repeated symbols with the help of a capturing group (.) and a back-reference that forces the same captured text to appear in the string. If the same characters are found, there will be no match. See IDEONE demo.
The (?!.*(.).*\\1) is a negative look-ahead. It only asserts the absence of some pattern after the current regex engine position, i.e. it checks and returns true if there is no match, otherwise it returns false. Thus, it does not not "consume" characters, it does not "match" the pattern inside the look-ahead, the regex engine stays at the same location in the input string. In this regex, it is the beginning of string (^). So, right at the beginning of the string, the regex engine starts looking for .* (any character but a newline, 0 or more repetitions), then captures 1 character (.) into group 1, again matches 0 or more characters with .*, and then tries to match the same text inside group 1 with \\1. Thus, if there is 121, there will be no match since the look-ahead will return false as it will find two 1s.
you can as well use this
grep('^([123])((?!\\1)\\d)(?!\\2|\\1)\\d', a, value=TRUE, perl=T)
see demo

Remove last occurrence of character

A question came across talkstats.com today in which the poster wanted to remove the last period of a string using regex (not strsplit). I made an attempt to do this but was unsuccessful.
N <- c("59.22.07", "58.01.32", "57.26.49")
#my attempts:
gsub("(!?\\.)", "", N)
gsub("([\\.]?!)", "", N)
How could we remove the last period in the string to get:
[1] "59.2207" "58.0132" "57.2649"
Maybe this reads a little better:
gsub("(.*)\\.(.*)", "\\1\\2", N)
[1] "59.2207" "58.0132" "57.2649"
Because it is greedy, the first (.*) will match everything up to the last . and store it in \\1. The second (.*) will match everything after the last . and store it in \\2.
It is a general answer in the sense you can replace the \\. with any character of your choice to remove the last occurence of that character. It is only one replacement to do!
You can even do:
gsub("(.*)\\.", "\\1", N)
You need this regex: -
[.](?=[^.]*$)
And replace it with empty string.
So, it should be like: -
gsub("[.](?=[^.]*$)","",N,perl = TRUE)
Explanation: -
[.] // Match a dot
(?= // Followed by
[^.] // Any character that is not a dot.
* // with 0 or more repetition
$ // Till the end. So, there should not be any dot after the dot we match.
)
So, as soon as a dot(.) is matched in the look-ahead, the match is failed, because, there is a dot somewhere after the current dot, the pattern is matching.
I'm sure you know this by now since you use stringi in your packages, but you can simply do
N <- c("59.22.07", "58.01.32", "57.26.49")
stringi::stri_replace_last_fixed(N, ".", "")
# [1] "59.2207" "58.0132" "57.2649"
I'm pretty lazy with my regex, but this works:
gsub("(*)(.)([0-9]+$)","\\1\\3",N)
I tend to take the opposite approach from the standard. Instead of replacing the '.' with a zero-length string, I just parse the two pieces that are on either side.

regular expression to strip leading characters up to first encountered digit

I have a string titled thisLine and I'd like to remove all characters before the first integer. I can use the command
regexpr("[0123456789]",thisLine)[1]
to determine the position of the first integer. How do I use that index to split the string?
The short answer:
sub('^\\D*', '', thisLine)
where
^ matches the beginning of the string
\\D matches any non-digit (it is the opposite of \\d)
\\D* tries to match as many consecutive non-digits as possible
My personal preference, skipping regexp altogether:
sub("^.*?(\\d)","\\1",thisLine)
#breaking down the regex
#^ beginning of line
#. any character
#* repeated any number of times (including 0)
#? minimal qualifier (match the fewest characters possible with *)
#() groups the digit
#\\d digit
#\\1 backreference to first captured group (the digit)
You want the substring function.
Or use gsub to do work in one shot:
> gsub('^[^[:digit:]]*[[:digit:]]', '', 'abc1def')
[1] "def"
You may want to include that first digit, which can be done with a capture:
> gsub('^[^[:digit:]]*([[:digit:]])', '\\1', 'abc1def')
[1] "1def"
Or as flodel and Alan indicate, simply replace "all leading digits" with a blank. See flodel's answer.