Extracting capturing groups from a regex - regex

This regex: (.*?)(?:I[0-9]-)*I3(?:-I[0-9])* matches an expression using multiple groups. The point of the regex is that it captures patterns in pairs of two, where the first part of the regex has to be followed by the second part of the regex.
How can I extract each of these two groups?
library(stringr)
data <- c("A-B-C-I1-I2-D-E-F-I1-I3-D-D-D-D-I1-I1-I2-I1-I1-I3-I3-I7")
str_extract_all(data, "(.*?)(?:I[0-9]-)*I3(?:-I[0-9])*")
Gives me:
[[1]]
[1] "A-B-C-I1-I2-D-E-F-I1-I3" "-D-D-D-D-I1-I1-I2-I1-I1-I3-I3-I7"
However, I would want something along the lines of:
[[1]]
[1] "A-B-C-I1-I2-D-E-F" [2] "I1-I3"
[[2]]
[1] "D-D-D-D" [2] "I1-I1-I2-I1-I1-I3-I3-I7"
The key here is that regex matches twice, each time containing 2 groups. I want every match to have a list of it's own, and that list to contain 2 elements, one for each group.

You need to wrap a capturing group around the second part of your expression and if you're using stringr for this task, I would use str_match_all instead to return the captured matches ...
library(stringr)
data <- c('A-B-C-I1-I2-D-E-F-I1-I3-D-D-D-D-I1-I1-I2-I1-I1-I3-I3-I7')
mat <- str_match_all(data, '-?(.*?)-((?:I[0-9]-)*I3(?:-I[0-9])*)')[[1]][,2:3]
colnames(mat) <- c('Group 1', 'Group 2')
# Group 1 Group 2
# [1,] "A-B-C-I1-I2-D-E-F" "I1-I3"
# [2,] "D-D-D-D" "I1-I1-I2-I1-I1-I3-I3-I7"

Related

Regular expression: matching multiple words

I am using regular expressions in R to extract strings from a variable. The variable contains distinct values that look like:
MEDIUM /REGULAR INSEAM
XX LARGE /SHORT INSEAM
SMALL /32" INSM
X LARGE /30" INSM
I have to capture two things: the value before the / as a whole(SMALL,XX LARGE) and the string(alphabetic or numeric) after it. I dont want the " INSM or the INSEAM part.
The regular expression for first two I am using is ([A-Z]\w+) \/([A-Z]\w+) INSEAM and for the last two I am using ([A-Z]\w+) \/([0-9][0-9])[" INSM].
The part ([A-Z]\w+) only captures one word, so it works fine for MEDIUM and SMALL, but fails for X LARGE, XX LARGE etc. Is there a way I can modify it to capture two occurances of word before the / character? Or is there a better way to do it?
Thanks in advance!
From your description, Wiktor's regex will fail on "XX LARGE/SHORT" due to the extra space. It is safer to capture everything before the forward slash as a group:
sub("^(.*/\\w+).*", "\\1", x)
#[1] "MEDIUM /REGULAR" "XX LARGE /SHORT" "SMALL /32" "X LARGE /30"
It seems you can use
(\w+(?: \w+)?) */ *(\w+)
See the regex demo
Pattern details:
(\w+(?: \w+)?) - Group 1 capturing one or more word chars followed with an optional sequence of a space + one or more word chars
*/ * - a / enclosed with 0+ spaces
(\w+) - Group 2 capturing 1 or more word chars
R code with stringr:
> library(stringr)
> v <- c("MEDIUM /REGULAR INSEAM", "XX LARGE /SHORT INSEAM", "SMALL /32\" INSM", "X LARGE /30\" INSM")
> str_match(v, "(\\w+(?: \\w+)?) */ *(\\w+)")
[,1] [,2] [,3]
[1,] "MEDIUM /REGULAR" "MEDIUM" "REGULAR"
[2,] "XX LARGE /SHORT" "XX LARGE" "SHORT"
[3,] "SMALL /32" "SMALL" "32"
[4,] "X LARGE /30" "X LARGE" "30"

Match everything but numbers regular expression

I want to have a regular expression that match anything that is not a correct mathematical number. the list below is a sample list as input for regex:
1
1.7654
-2.5
2-
2.
m
2..3
2....233..6
2.2.8
2--5
6-4-9
So the first three (in Bold) should not get selected and the rest should.
This is a close topic to another post but because of it's negative nature, it is different.
I'm using R but any regular expression will do I guess.
The following is the best shot in the mentioned post:
a <- c("1", "1.7654", "-2.5", "2-", "2.", "m", "2..3", "2....233..6", "2.2.8", "2--5", "6-4-9")
grep(pattern="(-?0[.]\\d+)|(-?[1-9]+\\d*([.]\\d+)?)|0$", x=a)
which outputs:
\[1\] 1 2 3 4 5 7 8 9 10 11
You can use following regex :
^(?:((\d+(?=[^.]+|\.{2,})).)+|(\d\.){2,}).*|[^\d]+$
See demo https://regex101.com/r/tZ3uH0/6
Note that your regex engine should support look-ahead with variable length.and you need to use multi-line flag and as mentioned in comment you can use perl=T to active look-ahead in R.
this regex is contains 2 part that have been concatenated with an OR.first part is :
(?:((\d+(?=[^.]+|\.{2,})).)+|(\d\.){2,}).*
which will match a combination of digits that followed by anything except dot or by 2 or more dot.which the whole of this is within a capture group that can be repeat and instead of this group you can have a digit which followed by dot 2 or more time (for matching some strings like 2.3.4.) .
and at the second part we have [^\d]+ which will match anything except digit.
Debuggex Demo
a[grep("^-?\\d*(\\.?\\d*)$", a, invert=T)]
With a suggested edit from #Frank.
Speed Test
a <- rep(a, 1e4)
all.equal(a[is.na(as.numeric(a))], a[grep("^-?\\d+(\\.?\\d+)?$|^\\d+\\.$", a, invert=T)])
[1] TRUE
library(microbenchmark)
microbenchmark(dosc = a[is.na(as.numeric(a))],
plafort = a[grep("^-?\\d*(\\.?\\d*)$", a, invert=T)])
# Unit: milliseconds
# expr min lq mean median uq max neval
# dosc 27.83477 28.32346 28.69970 28.51254 28.76202 31.24695 100
# plafort 31.92118 32.14915 32.62036 32.33349 32.71107 35.12258 100
I think this should do the job:
re <- "^-?[0-9]+$|^-?[0-9]+\\.[0-9]+$"
R> a[!grepl(re, a)]
#[1] "2-" "2." "m" "2..3" "2....233..6" "2.2.8" "2--5"
#[8] "6-4-9"
The solution here is good. You only have to add the negative case [-] and invert the selection!
a <- c("1", "1.7654", "-2.5", "2-", "2.", "m", "2..3", "2....233..6", "2.2.8", "2--5", "6-4-9")
a[grep(pattern="(^[1-9]\\d*(\\.\\d+)?$)|(^[-][1-9]\\d*(\\.\\d+)?$)",invert=TRUE, x=a)]
[1] "2-" "2." "m" "2..3" "2....233..6"
[6] "2.2.8" "2--5" "6-4-9"
Try this:
a[!grepl("^\\-?\\d?\\.?\\d+$", a)]
I like the simplicity of as.numeric(). This would be my suggestion:
require(stringr)
a <- c("1", "1.7654", "-2.5", "2-", "2.", "m", "2..3", "2....233..6", "2.2.8", "2--5", "6-4-9")
a
a1 <- ifelse(str_sub(a, -1) == ".", "string filler", a)
a1
outvect <- is.na(as.numeric(a1))
outvect

How to allow for arbitrary number of wildcards in regexes?

I have a list of character strings:
> head(g_patterns_clean_strings)
[[1]]
[1] "1FAFA"
[[2]]
[1] "FA,TRFA"
[[3]]
[1] "FAEX"
I am trying to identify specific patterns in these character strings, as such:
library(devtools)
g_patterns_clean <- source_gist("164f798524fd6904236a")[[1]]
g_patterns_clean_strings <- source_gist("af70a76691aacf05c1bb")[[1]]
FA_EX_logic_vector <- grepl(g_patterns_clean_strings, pattern = "(FAEX|EXFA)+")
FA_EX_cluster <- subset(g_patterns_clean, FA_EX_logic_vector)
Let's now say that I want to allow for an arbitrary number of other characters in between FA and EX (or EX and FA), how can I specify that in the regex above?
This is a flexible generalization of #eipi10's answer:
(FA.{0,2}EX|EX.{0,2}FA)
The . matches any character, and the {0,2} quantifier matches between 0 and 2 occurrences of .

Split column label by number of letters/characters in R

I have a large dataset where all column headers are individual IDS, each 8 characters in length. I would like to split those individual IDs into 2 rows, where the first row of IDs contains the first 7 characters, and the second row contains just the last character.
Current dataset:
ID1: Indiv01A Indiv01B Indiv02A Indiv02B Speci03A Speci03B
Intended dataset:
ID1: Indiv01 Indiv01 Indiv02 Indiv02 Speci03 Speci03
ID2: A B A B A B
I've looked through other posts on splitting data, but they all seem to have a unique way to separate the column name (ie: there's a comma separating the 2 components, or a period).
This is the code I'm thinking would work best, but I just can't figure out how to code for "7 characters" as the split point, rather than a comma:
sapply(strsplit(as.character(d$ID), ",")
Any help would be appreciated.
Here's a regular expression for a solution with strsplit. It splits the string between the 7th and the 8th character:
ID1 <- c("Indiv01A", "Indiv01B", "Indiv02A", "Indiv02B", "Speci03A", "Speci03B")
res <- strsplit(ID1, "(?<=.{7})", perl = TRUE)
# [[1]]
# [1] "Indiv01" "A"
#
# [[2]]
# [1] "Indiv01" "B"
#
# [[3]]
# [1] "Indiv02" "A"
#
# [[4]]
# [1] "Indiv02" "B"
#
# [[5]]
# [1] "Speci03" "A"
#
# [[6]]
# [1] "Speci03" "B"
Now, you can use rbind to create two columns:
do.call(rbind, res)
# [,1] [,2]
# [1,] "Indiv01" "A"
# [2,] "Indiv01" "B"
# [3,] "Indiv02" "A"
# [4,] "Indiv02" "B"
# [5,] "Speci03" "A"
# [6,] "Speci03" "B"
Explanation of the regex pattern:
(?<=.{7})
The (?<=) is a (positive) lookbehind. It matches any position that is preceded by the specified pattern. Here, the pattern is .{7}. The dot (.) matches any character. {7} means 7 times. Hence, the regex matches the position that is preceded by exactly 7 characters.
Here is a gsubfn solution:
library(gsubfn)
strapplyc(ID1, "(.*)(.)", simplify = cbind)
which gives this matrix:
[,1] [,2] [,3] [,4] [,5] [,6]
[1,] "Indiv01" "Indiv01" "Indiv02" "Indiv02" "Speci03" "Speci03"
[2,] "A" "B" "A" "B" "A" "B"
or use rbind in place of cbind if you want two columns (rather than two rows).
There are a couple of ways you could go about this.
To extract the final character
First, with substr:
new.vec <- sapply(old.vec, function(x) substr(x, nchar(x), nchar(x)))
or, with sub:
new.vec <- sub('.*(.)', '\\1', old.vec)
where old.vec is the vector of strings that you want to split.
For interest, the latter option uses a regular expression that translates to: "capture (indicating by surrounding with parentheses) the single character (.) that follows zero or more other characters (.*), and replace matches with the captured content (\\1)". For more info, see ?gsub, and here.
The above options allow for varying string lengths. However, if you do always want to split after 7 characters, and the second part of the string always has just a single character, then the following should work:
new.vec <- substr(old.vec, 8, 8)
(Edited to include method for extracting the first part of the string.)
To extract all but the final character
The process is similar.
new.vec <- sapply(old.vec, function(x) substr(x, 1, nchar(x) - 1))
new.vec <- sub('(.*).', '\\1', old.vec)
new.vec <- substr(old.vec, 1, 7)

Regex matching everything that's not a 4 digit number

I match and replace 4-digit numbers preceded and followed by white space with:
str12 <- "coihr 1234 &/()= jngm 34 ljd"
sub("\\s\\d{4}\\s", "", str12)
[1] "coihr&/()= jngm 34 ljd"
but, every try to invert this and extract the number instead fails.
I want:
[1] 1234
does someone has a clue?
ps: I know how to do it with {stringr} but am wondering if it's possible with {base} only..
require(stringr)
gsub("\\s", "", str_extract(str12, "\\s\\d{4}\\s"))
[1] "1234"
regmatches(), only available since R-2.14.0, allows you to "extract or replace matched substrings from match data obtained by regexpr, gregexpr or regexec"
Here are examples of how you could use regmatches() to extract either the first whitespace-cushioned 4-digit substring in your input character string, or all such substrings.
## Example strings and pattern
x <- "coihr 1234 &/()= jngm 34 ljd" # string with 1 matching substring
xx <- "coihr 1234 &/()= jngm 3444 6789 ljd" # string with >1 matching substring
pat <- "(?<=\\s)(\\d{4})(?=\\s)"
## Use regexpr() to extract *1st* matching substring
as.numeric(regmatches(x, regexpr(pat, x, perl=TRUE)))
# [1] 1234
as.numeric(regmatches(xx, regexpr(pat, xx, perl=TRUE)))
# [1] 1234
## Use gregexpr() to extract *all* matching substrings
as.numeric(regmatches(xx, gregexpr(pat, xx, perl=TRUE))[[1]])
# [1] 1234 3444 6789
(Note that this will return numeric(0) for character strings not containing a substring matching your criteria).
It's possible to capture group in regex using (). Taking the same example
str12 <- "coihr 1234 &/()= jngm 34 ljd"
gsub(".*\\s(\\d{4})\\s.*", "\\1", str12)
[1] "1234"
I'm pretty naive about regex in general, but here's an ugly way to do it in base:
# if it's always in the same spot as in your example
unlist(strsplit(str12, split = " "))[2]
# or if it can occur in various places
str13 <- unlist(strsplit(str12, split = " "))
str13[!is.na(as.integer(str13)) & nchar(str13) == 4] # issues warning