R regex for everything between LAST backslash and last dot - regex

full.path = 'C:\Users\me\Desktop\Data\my_file.csv'
I can't figure out the right regex to be left with only
essential.name = 'my_file'
I'm afraid I keep on failing on encoding correctly the last backslash

A platform-independent regex solution can also look like
> full.path = 'C:\\Users\\me\\Desktop\\Data\\my_file.csv'
> sub(".*\\\\([^.]*).*", "\\1", full.path)
[1] "my_file"
See online R demo.
Details:
.* - any 0+ characters as many as possible up to the last...
\\\\ - a literal \ symbol
([^.]*) - Group 1 capturing 0+ characters other than a dot
.* - and the rest of the characters up to its end.
The \\1 just inserts the contents of the Group 1 into the result.

We can use the basename and file_path_sans_ext (from tools) to extract the file name
tools::file_path_sans_ext(basename(full.path))
#[1] "my_file"
Or if we need regex, use gsub
gsub(".*\\\\|\\..*$", "", full.path)
#[1] "my_file"
data
full.path = 'C:\\Users\\me\\Desktop\\Data\\my_file.csv'

Related

Finding the `/` character as a separator using regexp as for phrases wrapped and not wrapped by other `/` characters

I'm trying to create a regexp that can find occurrences of / from a string however the following rules must be satisfied:
/ is a separator between each string, e.g: /string1/string2/string3/
The / is also a separator between regular expressions like /regexp1//regexp2//regexp3/
The goal is to find all occurrences of the separator / that satisfy such a condition
As a result, I would like to get the separators between the following phrases
string1
string2
string3
/regexp1/
/regexp2/
/regexp3/
string4
/string1/string2/string3//regexp1///regexp2///regexp3//string4/
Currently I prepared the following regexp, but unfortunately it doesn't work as I expect, because it doesn't handle when there are 2 regexps next to each other. Does anyone have any advice how to overcome such case?
((?<=\/)\/(?=\/)|(?<!\/)\/(?!\/)|(?<=\w)\/(?=\/)|(?<=\/)\/(?=\w)|\/$)
You may use this regex with an alternation and grab capture group #1:
(?<=\/)(\/[^\/]+\/|[^\/]+)(?:\/|$)
RegEx Demo
RegEx Details:
(?<=\/): Assert that previous character is /
( Start capture group #1
\/: Match a /
[^/]+: Match 1+ non-/` characters
\/: Match a /
|: OR
[^\/]+: Match 1+ non-/ characters
): End capture group #1
(?:\/|$): Match a / or end position

How to match a string and white space in R

I have a dataframe with columns having values like:
"Average 18.24" "Error 23.34". My objective is to replace the text and following space from these. in R. Can any body help me with a regex pattern to do this?
I am able to successfully do this using the [A-Z]. But i am not able to combine the white space. [A-Z][[:space:]] no luck.
Your help is appreciated.
We can use sub. Use the pattern \\D+ to match all non-numeric characters and then use '' in the replacement to remove those.
sub("\\D+", '', v2)
#[1] "18.24" "23.34"
Or match one or more word characters followed by one or more space and replace with ''.
sub("\\w+\\s+", "", v2)
#[1] "18.24" "23.34"
Or if we are using stringr
library(stringr)
word(v2, 2)
#[1] "18.24" "23.34"
data
v2 <- c("Average 18.24" ,"Error 23.34")
You can use a quantifier and add a-z to the pattern (and the ^ anchor)
You can use
"^\\S+\\s+"
"^[a-zA-Z]+[[:space:]]+"
See regex demo
R demo:
> b <- c("Average 18.24", "Error 23.34")
> sub("^[A-Za-z]+[[:space:]]+", "", b)
> ## or sub("^\\S+\\s+", "", b)
[1] "18.24" "23.34"
Details:
^ - start of string
[A-Za-z]+ - one or more letters (replace with \\S+ to match 1 or more non-whitespaces)
[[:space:]]+ - 1+ whitespaces (or \\s+ will match 1 or more whitespaces)

Extracting part of string using regular expressions

I’m struggling to get a bit of regular expressions code to work. I have a long list of strings that I need to partially extract. I need only strings that starting with “WER” and I only need the last part of the string commencing (including) on the letter.
test <- c("abc00012Z345678","WER0004H987654","WER12400G789456","WERF12","0-0Y123")
Here is the line of code which is working but only for one letter. However in my list of strings it can have any letter.
ifelse(substr(test,1,3)=="WER",gsub("^.*H.*?","H",test),"")
What I’m hoping to achieve is the following:
H987654
G789456
F12
You can use the following pattern with gsub:
> gsub("^(?:WER.*([a-zA-Z]\\d*)|.*)$", "\\1", test)
[1] "" "H987654" "G789456" "F12" ""
See the regex demo
This pattern matches:
^ - start of a string
(?: - start of an alternation group with 2 alternatives:
WER.*([a-zA-Z]\\d*) - WER char sequence followed with 0+ any characters (.*) as many as possible up to the last letter ([a-zA-Z]) followed by 0+ digits (\\d*) (replace with \\d+ to match 1+ digits, to require at least 1 digit)
| - or
`.* - any 0+ characters
)$ - closing the alternation group and match the end of string with $.
With str_match from stringr, it is even tidier:
> library(stringr)
> res <- str_match(test, "^WER.*([a-zA-Z]\\d*)$")
> res[,2]
[1] NA "H987654" "G789456" "F12" NA
>
See another regex demo
If there are newlines in the input, add (?s) at the beginning of the pattern: res <- str_match(test, "(?s)^WER.*([a-zA-Z]\\d*)$").
If you don't want empty strings or NA for strings that don't start with "WER", you could try the following approach:
sub(".*([A-Z].*)$", "\\1", test[grepl("^WER", test)])
#[1] "H987654" "G789456" "F12"

How to remove digit at the start of a file name?

I want to remove digits at the start of a filename. For example:
atoms/01-headings/01-heading-level-01.html
to
atoms/headings/heading-level-01.html
I've build this regex /(^\d+|(?=\/)\/\d+)[\-\.]/img but it seem that the positive lookahead (?=\/) consume the / too.
How to not consume it ?
Here's my tests: https://regex101.com/r/hK0rZ2/5
Search using this regex:
/(^|\/)\d+[.-]?/img
and replace by:
"$1"
Captured group #1 (^|\/) matches either start position or a / followed by 1 or more digits and an optional hyphen OR DOT. In replacement we $1 as back reference of captured group #1.
Updated RegEx Demo
It is not the lookahead that consumes the slash, the \/ in \/\d+ does.
You can use
/(?:^|(\/))\d+[-.]/igm
And replace with $1.
See the regex demo
The regex matches:
(?:^|(\/)) - either the start of a line (^) or a / symbol and will capture the / into Group 1 (we'll later restore it with a $1 backreference)
\d+ - one or more digits
[-.] - either - or a . literally (since it is a character class, and the hyphen is at the beginning of it, no escaping is necessary).
var re = /(?:^|(\/))\d+[-.]/img;
var str = '01-heading-level-01.html\n02-heading-level-02.html\natoms/01-headings/01-heading-level-01.html\natoms/01-headings/01-heading-level-1.html\natoms/01-headings/02-heading-level-02.html\natoms/01-headings/02-heading-level-2.html\natoms/01-headings/01-heading-level-01/01-headings-level-01-red.html';
var result = str.replace(re, '$1');
document.body.innerHTML = result.replace(/\n/g, "<br/>");

Remove any digit only in first N characters

I'm looking for a regular expression to catch all digits in the first 7 characters in a string.
This string has 12 characters:
A12B345CD678
I would like to remove A and B only since they are within the first 7 chars (A12B345) and get
12345CD678
So, the CD678 should not be touched. My current solution in R:
paste(paste(str_extract_all(substr("A12B345CD678",1,7), "[0-9]+")[[1]],collapse=""),substr("A12B345CD678",8,nchar("A12B345CD678")),sep="‌​")
It seems too complicated. I split the string at 7 as described, match any digits in the first 7 characters and bind it with the rest of the string.
Looking for a general answer, my current solution is to split the first 7 characters and just match all digits in this sub string.
Any help appreciated.
You can use the known SKIP-FAIL regex trick to match all the rest of the string beginning with the 8th character, and only match non-digit characters within the first 7 with a lookbehind:
s <- "A12B345CD678"
gsub("(?<=.{7}).*$(*SKIP)(*F)|\\D", "", s, perl=T)
## => [1] "12345CD678"
See IDEONE demo
The perl=T is required for this regex to work. The regex breakdown:
(?<=.{7}).*$(*SKIP)(*F) - matches any character but a newline (add (?s) at the beginning if you have newline symbols in the input), as many as possible (.*) up to the end ($, also \\z might be required to remove final newlines), but only if preceded with 7 characters (this is set by the lookbehind (?<=.{7})). The (*SKIP)(*F) verbs make the engine omit the whole matched text and advance the regex index to the position at the end of that text.
| - or...
\\D - a non-digit character.
See the regex demo.
The regex solution is cool, but I'd use something easier to read for maintainability. E.g.
library(stringr)
str_sub(s, 1, 7) = gsub('[A-Z]', '', str_sub(s, 1, 7))
You can also use a simple negative lookbehind:
s <- "A12B345CD678"
gsub("(?<!.{7})\\D", "", s, perl=T)