regex select multiple groups - regex

I have the following string from which I want to extract the content between the second pair of colons (in bold in the example):
"20160607181026_0000005:0607181026000000501:ES5206956802492:479"
I am using R and specifically the stringr package to manipulate strings.
The command I attempted to use is:
str_extract("20160607181026_0000005:0607181026000000501:ES5206956802492:479", ":(.*):")
where the regex pattern is expressed at the end of the command. This produces the following result:
":0607181026000000501:ES5206956802492:"
I know that there is a way of grouping results and back-reference them, which would allow me to select only the part I am interested in, but I don't seem to be able to figure out the right syntax.
How can I achieve this?

Also word from stringr,
library(stringr)
word(v1, 3, sep=':')
#[1] "ES5206956802492"

If the first character after the : starts with LETTERS, then we can use a compact regex. Here, we use regex lookaround ((?<=:)) and match a LETTERS ([A-Z]) that follows the : followed by one of more characters that are not a : ([^:]+).
str_extract(v1, "(?<=:)[A-Z][^:]+")
#[1] "ES5206956802492"
or if it is based on the position i.e. 2nd position, a base R option would be to match zero or more non : ([^:]*) followed by the first : followed by zero or more non : followed by the second : and then we capture the non : in a group ((...)) and followed by rest of the characters (.*). In the replacement, we use the backreference, i.e. \\1 (first capture group).
sub("[^:]*:[^:]*:([^:]+).*", "\\1", v1)
#[1] "ES5206956802492"
Or the repeating part can be captured to make it compact
sub("([^:]*:){2}([^:]+).*", "\\2", v1)
#[1] "ES5206956802492"
Or with strsplit, we split at delimiter : and extract the 3rd element.
strsplit(v1, ":")[[1]][3]
#[1] "ES5206956802492"
data
v1 <- "20160607181026_0000005:0607181026000000501:ES5206956802492:479"

Related

capture repetition of letters in a word with regex

I'm trying to detect conditions where words have repetition of letters, and i would like to replace such matched conditions with the repeated letter. The text is in Hebrew. For instance, שללללוווווםםםם should just become שלום.
Basically,when a letter repeats itself 3 times or more - it should be detected and replaced.
I want to use the regex expression for r gsub.
df$text <- gsub("?", "?", df$text)
You can use
> x = "שללללוווווםםםם"
> gsub("(.)\\1{2,}", "\\1", x)
#[1] "שלום"
NOTE :- It will replace any character (not just hebrew) which is repeated more than three times.
or following for only letter/digit from any language
> gsub("(\\w)\\1{2,}", "\\1", x)
If you plan to only remove repeating characters from the Hebrew script (keeping others), I'd suggest:
s <- "שללללוווווםםםם ......... שללללוווווםםםם"
gsub("(\\p{Hebrew})\\1{2,}", "\\1", s, perl=TRUE)
See the regex demo in R
Details:
(\\p{Hebrew}) - Group 1 capturing a character from Hebrew script (as \p{Hebrew} is a Unicode property/category class)
\\1{2,} - 2 or more (due to {2,} limiting quantifier) same characters stored in Group 1 buffer (as \\1 is a backreference to Group 1 contents).

Remove letters matching pattern before and after the required string

I have a vector with the following elements:
myvec<- c("output.chr10.recalibrated", "output.chr11.recalibrated",
"output.chrY.recalibrated")
I want to selectively extract the value after chr and before .recalibrated and get the result.
Result:
10, 11, Y
You can do that with a mere sub:
> sub(".*?chr(.*?)\\.recalibrated.*", "\\1", myvec)
[1] "10" "11" "Y"
The pattern matches any symbols before the first chr, then matches and captures any characters up to the first .recalibrated, and then matches the rest of the characters. In the replacement pattern, we use a backreference \1 that inserts the captured value you need back into the resulting string.
See the regex demo
As an alternative, use str_match:
> library(stringr)
> str_match(myvec, "chr(.*?)\\.recalibrated")[,2]
[1] "10" "11" "Y"
It keeps all captured values and helps avoid costly unanchored lookarounds in the pattern that are necessary in str_extract.
The pattern means:
chr - match a sequence of literal characters chr
(.*?) - match any characters other than a newline (if you need to match newlines, too, add (?s) at the beginning of the pattern) up to the first
\\.recalibrated - .recalibrated literal character sequence.
Both answers failing in case of slightly different inputs like whatever.chr10.whateverelse.recalibrated here's my own approach only differing on the regex part with sub:
sub(".*[.]chr([^.]*)[.].*", "\\1", myvec)
what the regex does is:
.*[.]chr match as much as possible until finding '.chr' literraly
([^.]*) capture everything not a dot after chr (could be replaced by \\d+ to capture only numeric values, requiring at least one digit present
[.].* match the rest of the line after a literal dot
I prefer the character class escape of dots ([.]) on the backslash escape (\\.) as it's usually easier to read when you're back on the regex, that's my my opinion and not covered by any best practice I know of.
We can use str_extract to do this. We match one of more characters (.*) that follow 'chr' ((?<=chr)) and before the .recalibrated ((?=\\.recalibrated)).
library(stringr)
str_extract(myvec, "(?<=chr).*(?=\\.recalibrated)")
#[1] "10" "11" "Y"
Or use gsub to match the characters until chr or (|) that starts from .recalibrated to the end ($) of the string and replace it with ''.
gsub(".*\\.chr|\\.recalibrated.*$", "", myvec)
#[1] "10" "11" "Y"
Looks like XY problem. Why extract? If this is needed in further analysis steps, we could for example do this instead:
for(chrN in c(1:22, "X", "Y")) {
myVar <- paste0("output.chr", chrN, ".recalibrated")
#do some fun stuff with myVar
print(myVar)
}

grep float with percent sign in R

I have long strings containing various text and numeric data like this
a <- "$3,295,000; 8 Units; 4.08% Cap Rate; 9,360 SF Bldg;"
and I would like to be able to extract the percentage, in this case 4.08%.
How can I match this pattern with grep()?
You can use a non-greedy match in sub for this:
sub('.*?([0-9.]+%).*', '\\1', a)
[1] "4.08%"
This will only match the first instance of the pattern in the string.
The .*? is non-greedy, so it won't "suck up" characters in the following pattern, which matches what you want.
We can use str_extract
library(stringr)
str_extract_all(a, "[0-9.]+%")[[1]]
#[1] "4.08%"
If we only need to match the first instance, use str_extract instead of str_extract_all. Though the other answer also does the same thing, it is better to use str_extract_all for multiple cases.
grep is used to return the index of a string whether it has a particular pattern or not. Suppose if we use grep, it will return the index as 1 (as there is only one element in the vector and it matches the pattern)
grep("[0-9.]+$", a)
#[1] 1
For extracting substring either str_extract or gsub (from base R can be used.
The pattern you should be using is (?:[0-9]{1,2}(?:\.[0-9]{1,2})?%)
It matches Upto 2 digits . Upto 2 digits %
Regex101 Demo

regex - excluding a specific part of an URL via regex match in gsub

I'm working with a vector below:
vec <- c("http://statistics.gov.scot/id/statistical-geography/S02000002",
"http://statistics.gov.scot/id/statistical-geography/S02000003")
I would like to remove http://statistics.gov.scot/id/statistical-geography/ from the vector. My present regex syntax:
vec_cln <- gsub(replacement = "", x = vec, perl = TRUE, fixed = FALSE,
pattern = "([[:alnum:]]|[[:punct:]]|)(?<!S\\d{8})")
But this leaves only last digit from vector vec. I'm guessing that the problem is with \\d{8}, however, it's not clear to me how to work around it. I tried various solutions on regex101 but to no avail. Some examples:
(?<!S\d) - this leaves second digit
(?<!S[[:digit:]]) - same
What I'm trying to achieve can be simply summarised: *match everything until you find a capital letter S and 8 digits after.
Notes
I want to arrive at the solution via gsub and regex I don't want to use:
gsubfn and proto objects
I'm not interested in using substr as I may have to work with strings of variable lengths
You can obtain the result using
sub(".*(S\\d{8})", "\\1", vec)
See demo
With .*, we match any amount of (* - 0 or more) any characters but a newline up to the S followed by 8 digits (S\\d{8}). Since (S\\d{8}) is inside unescaped parentheses, the substring matched by this subpattern is placed into a capture group #1. With \\1 backreference, we restore the captured text in the result.
See more about backreferences and capturing groups at regular-expressions.info.
NOTE: if you have more text after S+8 digits, you can use
sub("^.*(S\\d{8}).*$", "\\1", vec)
Here it is with slightly prettier syntax:
library(rex)
library(stringi)
library(magrittr)
regex_1 = rex("S", digits)
vec <- c("http://statistics.gov.scot/id/statistical-geography/S02000002",
"http://statistics.gov.scot/id/statistical-geography/S02000003")
vec %>% stri_extract_last_regex(regex_1)

R - regular expression - capturing a number in file name

I have several files. Their name example is as follows :-
ABC2_5XYZ_7_data.csv
DEF2_10QST_7_data.csv
Everytime when I read the filenames, I would like to capture the number beside the _ and store them into another variable.
In the above example these are the "5" and "10".
Can anyone suggest something ?
I think this would work. I added a couple more strings just to make sure. Since we are looking for the first and only match, we can use sub().
x <- c("ABC2_5XYZ_data.csv", "DEF2_10QST_data.csv", "A123_456ABC_data.csv", "X9F4_7912D_data.csv")
sub(".*_(\\d+).*", "\\1", x)
# [1] "5" "10" "456" "7912"
The regular expression .*_(\\d+).* captures the digits immediately following the underscore. The \\1 returns us the captured digits.
.* matches any character (except newline)
_ matches the character _ literally
( starts the capturing group
\\d+ match a digit one or more times
) ends the capturing group
.* matches any character (except newline)
Further explanation can be found at regex101
Update after OP changed the question: In response to your comments, and the changed question, you can use the following. Note that we are still using sub() (not gsub()!) since we want the first match.
x <- c("ABC2_5XYZ_7_data.csv", "DEF2_10QST_7_data.csv")
sub("[[:alnum:]]+_(\\d+).*", "\\1", x)
# [1] "5" "10"