Remove last occurrence of character - regex

A question came across talkstats.com today in which the poster wanted to remove the last period of a string using regex (not strsplit). I made an attempt to do this but was unsuccessful.
N <- c("59.22.07", "58.01.32", "57.26.49")
#my attempts:
gsub("(!?\\.)", "", N)
gsub("([\\.]?!)", "", N)
How could we remove the last period in the string to get:
[1] "59.2207" "58.0132" "57.2649"

Maybe this reads a little better:
gsub("(.*)\\.(.*)", "\\1\\2", N)
[1] "59.2207" "58.0132" "57.2649"
Because it is greedy, the first (.*) will match everything up to the last . and store it in \\1. The second (.*) will match everything after the last . and store it in \\2.
It is a general answer in the sense you can replace the \\. with any character of your choice to remove the last occurence of that character. It is only one replacement to do!
You can even do:
gsub("(.*)\\.", "\\1", N)

You need this regex: -
[.](?=[^.]*$)
And replace it with empty string.
So, it should be like: -
gsub("[.](?=[^.]*$)","",N,perl = TRUE)
Explanation: -
[.] // Match a dot
(?= // Followed by
[^.] // Any character that is not a dot.
* // with 0 or more repetition
$ // Till the end. So, there should not be any dot after the dot we match.
)
So, as soon as a dot(.) is matched in the look-ahead, the match is failed, because, there is a dot somewhere after the current dot, the pattern is matching.

I'm sure you know this by now since you use stringi in your packages, but you can simply do
N <- c("59.22.07", "58.01.32", "57.26.49")
stringi::stri_replace_last_fixed(N, ".", "")
# [1] "59.2207" "58.0132" "57.2649"

I'm pretty lazy with my regex, but this works:
gsub("(*)(.)([0-9]+$)","\\1\\3",N)
I tend to take the opposite approach from the standard. Instead of replacing the '.' with a zero-length string, I just parse the two pieces that are on either side.

Related

R- regex extracting a string between a dash and a period

First of all I apologize if this question is too naive or has been repeated earlier. I tried to find it in the forum but I'm posting it as a question because I failed to find an answer.
I have a data frame with column names as follows;
head(rownames(u))
[1] "A17-R-Null-C-3.AT2G41240" "A18-R-Null-C-3.AT2G41240" "B19-R-Null-C-3.AT2G41240"
[4] "B20-R-Null-C-3.AT2G41240" "A21-R-Transgenic-C-3.AT2G41240" "A22-R-Transgenic-C-3.AT2G41240"
What I want is to use regex in R to extract the string in between the first dash and the last period.
Anticipated results are,
[1] "R-Null-C-3" "R-Null-C-3" "R-Null-C-3"
[4] "R-Null-C-3" "R-Transgenic-C-3" "R-Transgenic-C-3"
I tried following with no luck...
gsub("^[^-]*-|.+\\.","\\2", rownames(u))
gsub("^.+-","", rownames(u))
sub("^[^-]*.|\\..","", rownames(u))
Would someone be able to help me with this problem?
Thanks a lot in advance.
Shani.
Here is a solution to be used with gsub:
v <- c("A17-R-Null-C-3.AT2G41240", "A18-R-Null-C-3.AT2G41240", "B19-R-Null-C-3.AT2G41240", "B20-R-Null-C-3.AT2G41240", "A21-R-Transgenic-C-3.AT2G41240", "A22-R-Transgenic-C-3.AT2G41240")
gsub("^[^-]*-([^.]+).*", "\\1", v)
See IDEONE demo
The regex matches:
^[^-]* - zero or more characters other than -
- - a hyphen
([^.]+) - Group 1 matching and capturing one or more characters other than a dot
.* - any characters (even including a newline since perl=T is not used), any number of occurrences up to the end of the string.
This can easily be achieved with the following regex:
-([^.]+)
# look for a dash
# then match everything that is not a dot
# and save it to the first group
See a demo on regex101.com. Outputs are:
R-Null-C-3
R-Null-C-3
R-Null-C-3
R-Null-C-3
R-Transgenic-C-3
R-Transgenic-C-3
Regex
-([^.]+)\\.
Description
- matches the character - literally
1st Capturing group ([^\\.]+)
[^\.]+ match a single character not present in the list below
Quantifier: + Between one and unlimited times, as many times as possible, giving back as needed [greedy]
. matches the character . literally
\\. matches the character . literally
Debuggex Demo
Output
MATCH 1
1. [4-14] `R-Null-C-3`
MATCH 2
1. [29-39] `R-Null-C-3`
MATCH 3
1. [54-64] `R-Null-C-3`
MATCH 4
1. [85-95] `R-Null-C-3`
MATCH 5
1. [110-126] `R-Transgenic-C-3`
MATCH 6
1. [141-157] `R-Transgenic-C-3`
This seems an appropriate case for lookarounds:
library(stringr)
str_extract(v, '(?<=-).*(?=\\.)')
where
(?<= ... ) is a positive lookbehind, i.e. it looks for a - immediately before the next captured group;
.* is any character . repeated 0 or more times *;
(?= ... ) is a positive lookahead, i.e. it looks for a period (escaped as \\.) following what is actually captured.
I used stringr::str_extract above because it's more direct in terms of what you're trying to do. It is possible to do the same thing with sub (or gsub), but the regex has to be uglier:
sub('.*?(?<=-)(.*)(?=\\.).*', '\\1', v, perl = TRUE)
.*? looks for any character . from 0 to as few as possible times *? (lazy evaluation);
the lookbehind (?<=-) is the same as above;
now the part we want .* is put in a captured group (...), which we'll need later;
the lookahead (?=\\.) is the same;
.* captures any character, repeated 0 to as many as possible times (here the end of the string).
The replacement is \\1, which refers to the first captured group from the pattern regex.

Remove any digit only in first N characters

I'm looking for a regular expression to catch all digits in the first 7 characters in a string.
This string has 12 characters:
A12B345CD678
I would like to remove A and B only since they are within the first 7 chars (A12B345) and get
12345CD678
So, the CD678 should not be touched. My current solution in R:
paste(paste(str_extract_all(substr("A12B345CD678",1,7), "[0-9]+")[[1]],collapse=""),substr("A12B345CD678",8,nchar("A12B345CD678")),sep="‌​")
It seems too complicated. I split the string at 7 as described, match any digits in the first 7 characters and bind it with the rest of the string.
Looking for a general answer, my current solution is to split the first 7 characters and just match all digits in this sub string.
Any help appreciated.
You can use the known SKIP-FAIL regex trick to match all the rest of the string beginning with the 8th character, and only match non-digit characters within the first 7 with a lookbehind:
s <- "A12B345CD678"
gsub("(?<=.{7}).*$(*SKIP)(*F)|\\D", "", s, perl=T)
## => [1] "12345CD678"
See IDEONE demo
The perl=T is required for this regex to work. The regex breakdown:
(?<=.{7}).*$(*SKIP)(*F) - matches any character but a newline (add (?s) at the beginning if you have newline symbols in the input), as many as possible (.*) up to the end ($, also \\z might be required to remove final newlines), but only if preceded with 7 characters (this is set by the lookbehind (?<=.{7})). The (*SKIP)(*F) verbs make the engine omit the whole matched text and advance the regex index to the position at the end of that text.
| - or...
\\D - a non-digit character.
See the regex demo.
The regex solution is cool, but I'd use something easier to read for maintainability. E.g.
library(stringr)
str_sub(s, 1, 7) = gsub('[A-Z]', '', str_sub(s, 1, 7))
You can also use a simple negative lookbehind:
s <- "A12B345CD678"
gsub("(?<!.{7})\\D", "", s, perl=T)

R - regular expression - capturing a number in file name

I have several files. Their name example is as follows :-
ABC2_5XYZ_7_data.csv
DEF2_10QST_7_data.csv
Everytime when I read the filenames, I would like to capture the number beside the _ and store them into another variable.
In the above example these are the "5" and "10".
Can anyone suggest something ?
I think this would work. I added a couple more strings just to make sure. Since we are looking for the first and only match, we can use sub().
x <- c("ABC2_5XYZ_data.csv", "DEF2_10QST_data.csv", "A123_456ABC_data.csv", "X9F4_7912D_data.csv")
sub(".*_(\\d+).*", "\\1", x)
# [1] "5" "10" "456" "7912"
The regular expression .*_(\\d+).* captures the digits immediately following the underscore. The \\1 returns us the captured digits.
.* matches any character (except newline)
_ matches the character _ literally
( starts the capturing group
\\d+ match a digit one or more times
) ends the capturing group
.* matches any character (except newline)
Further explanation can be found at regex101
Update after OP changed the question: In response to your comments, and the changed question, you can use the following. Note that we are still using sub() (not gsub()!) since we want the first match.
x <- c("ABC2_5XYZ_7_data.csv", "DEF2_10QST_7_data.csv")
sub("[[:alnum:]]+_(\\d+).*", "\\1", x)
# [1] "5" "10"

What is the purpose of .*\\?

I have been playing around with list.files() and I wanted to only list 001.csv through 010.csv and I came up with this command:
list_files <- list.files(directory, pattern = ".*\\000|010", full.names = TRUE)
This code gives me what I want, but I do not fully understand what is happening with the pattern argument. How does pattern = .*\\\000 work?
\\0 is a backreference that inserts the whole regex match to that point. Compare the following to see what that can mean:
sub("he", "", "hehello")
## [1] "hello"
sub("he\\0", "", "hehello")
## [1] "llo"
With strings like "001.csv" or "009.csv", what happens is that the .* matches zero characters, the \\0 repeats those zero characters one time, and the 00 matches the first two zeros in the string. Success!
This pattern won't match "100.csv" or "010.csv" because it can't find anything to match that is doubled and then immediately followed by two 0s. It will, though, match "1100.csv", because it matches 1, then doubles it, and then finds two 0s.
So, to recap, ".*\\000" matches any string beginning with xx00 where x stands for any substring of zero or more characters. That is, it matches anything repeated twice and then folllowed by two zeros.

regular expression to strip leading characters up to first encountered digit

I have a string titled thisLine and I'd like to remove all characters before the first integer. I can use the command
regexpr("[0123456789]",thisLine)[1]
to determine the position of the first integer. How do I use that index to split the string?
The short answer:
sub('^\\D*', '', thisLine)
where
^ matches the beginning of the string
\\D matches any non-digit (it is the opposite of \\d)
\\D* tries to match as many consecutive non-digits as possible
My personal preference, skipping regexp altogether:
sub("^.*?(\\d)","\\1",thisLine)
#breaking down the regex
#^ beginning of line
#. any character
#* repeated any number of times (including 0)
#? minimal qualifier (match the fewest characters possible with *)
#() groups the digit
#\\d digit
#\\1 backreference to first captured group (the digit)
You want the substring function.
Or use gsub to do work in one shot:
> gsub('^[^[:digit:]]*[[:digit:]]', '', 'abc1def')
[1] "def"
You may want to include that first digit, which can be done with a capture:
> gsub('^[^[:digit:]]*([[:digit:]])', '\\1', 'abc1def')
[1] "1def"
Or as flodel and Alan indicate, simply replace "all leading digits" with a blank. See flodel's answer.