Capturing parts of string using regular expression in R - regex

I have these strings:
myseq <- c("ALM_GSK_LN_06.ID","AS04_LV_06.ID.png","AS04_SP_06.IP.png")
What I want to do is to capture parts of the sequence
ALM_GSK LN ID
AS04 LV ID
AS04 SP IP
I tried this but failed:
library(stringr)
str_match(myseq, "([A-Z]+)_(LN|LV|SP)_06\\.([A-Z]+)")
Which produces:
[,1] [,2] [,3] [,4]
[1,] "GSK_LN_06.ID" "GSK" "LN" "ID"
[2,] NA NA NA NA
[3,] NA NA NA NA
>
What's the right way to do it?

You are pretty close. Here is a small adjustment:
str_match(myseq, "(.+)_(LN|LV|SP)_06\\.([A-Z]+)")[, -1]
produces:
[,1] [,2] [,3]
[1,] "ALM_GSK" "LN" "ID"
[2,] "AS04" "LV" "ID"
[3,] "AS04" "SP" "IP"
Yours doesn't work because your first token matches neither numbers or underscores, which you need for "AS04" (numbers) and "ALM_GSK" (underscores).

Your regular expression incorrectly matches the prefix because [A-Z]+ only matches letters. To fix this simply change the first group to a greedy operator such as (.+), here is another solution.
library(gsubfn)
myseq <- c('ALM_GSK_LN_06.ID', 'AS04_LV_06.ID.png', 'AS04_SP_06.IP.png')
strapply(myseq, '(.+)_([A-Z]+)[^.]+\\.([A-Z]+)', c, simplify = rbind)
# [,1] [,2] [,3]
# [1,] "ALM_GSK" "LN" "ID"
# [2,] "AS04" "LV" "ID"
# [3,] "AS04" "SP" "IP"

Totally stealing #hwnd's regex but in a tidyr/dplyr approach:
library(dplyr); library(tidyr)
data_frame(myseq) %>%
extract(myseq, c('A', 'B', 'C'), '(.+)_([A-Z]+)[^.]+\\.([A-Z]+)')
## A B C
## 1 ALM_GSK LN ID
## 2 AS04 LV ID
## 3 AS04 SP IP

Related

Cleaning up dates (years, specifically) with regex

I have database with an non-validated year field. Most of the entries are 4-digit years but about 10% of the entries are "whatever." This has led me down the rabbit hole of regular expressions to little avail. Getting better results than what I have is progress, even if I don't extract 100%.
#what a mess
yearEntries <- c("79, 80, 99","07-26-08","07-26-2008","'96 ","Early 70's","93/95","late 70's","15","late 60s","Late 70's",NA,"2013","1992-1993")
#does a good job with any string containing a 4-digit year
as.numeric(sub('\\D*(\\d{4}).*', '\\1', yearEntries))
#does a good job with any string containing a 2-digit year, nought else
as.numeric(sub('\\D*(\\d{2}).*', '\\1', yearEntries))
The desired output is to grab the first readable year, so 1992-1993 would be 1992 and "the 70s" would be 1970.
How can I improve my parsing accuracy? Thanks!
EDIT: Pursuant to garyh's answer this gets me much closer:
sub("\\D*((?<!\\d)\\d{2}(?!\\-|\\d)|\\d{4}).*","\\1",yearEntries,perl=TRUE)
# [1] "79" "07-2608" "07-262008" "96" "70" "93" "70" "15" "60" "70" NA "2013" "1992"
but note that, while the dates with dashes in them work with garyh's regex101.com demo, they don't work with R, keeping the month and day values, and the first dash.
Also I realize I didn't include an example date with slashes rather dashes. Another term in the regex should handle that but again, with R, it doesn't not produce the same (correct) result that regex101.com does.
sub("\\D*((?<!\\d)\\d{2}(?!\\-|\\/|\\d)|\\d{4}).*","\\1","07/09/13",perl=TRUE)
# [1] "07/0913"
These negative lookbacks and lookaheads are very powerful but stretch my feeble brain.
Not sure what flavour of regex R uses but this seems to get all the years in the string
/((?<!\d)\d{2}(?!\-|\d)|\d{4})/g
This is matching any 4 digits or any 2 digits provided they're not followed by a dash - or digit, or preceded by another digit
see demo here
You're going to need some elbow grease and do something like:
library(lubridate)
yearEntries <- c("79, 80, 99","07-26-08","07-26-2008","'96 ","Early 70's","93/95","late 70's","15","late 60s","Late 70's",NA,"2013","1992-1993")
x <- yearEntries
x <- gsub("(late|early)", "", x, ignore.case=TRUE)
x <- gsub("[']*[s]*", "", x)
x <- gsub(",.*$", "", x)
x <- gsub(" ", "", x)
x <- ifelse(nchar(x)==9 | nchar(x)<8, gsub("[-/]+[[:digit:]]+$", "", x), x)
x <- ifelse(nchar(x)==4, gsub("^[[:digit:]]{2}", "", x), x)
y <- format(parse_date_time(x, "%m-%d-%y!"), "%y")
yearEntries <-ifelse(!is.na(y), y, x)
yearEntries
## [1] "79" "08" "08" "96" "70" "93" "70" "15" "60" "70" NA "13" "92"
We have no idea which year you need from ranged entries, but this should get you started.
I found a very simple way to get a good result (though I would not claim it is bullet proof). It grabs the last readable year, which is okay too.
yearEntries <- c("79, 80, 99","07/26/08","07-26-2008","'96 ","Early 70's","93/95","15",NA,"2013","1992-1993","ongoing")
# assume last two digits present in any string represent a 2-digit year
a<-sub(".*(\\d{2}).*$","\\1",yearEntries)
# [1] "99" "08" "08" "96" "70" "95" "15" "ongoing" NA "13" "93"
# change to numeric, strip NAs and add 2000
b<-na.omit(as.numeric(a))+2000
# [1] 2099 2008 2008 2096 2070 2095 2015 2013 2093
# assume any greater than present is last century
b[b>2015]<-b[b>2015]-100
# [1] 1999 2008 2008 1996 1970 1995 2015 2013 1993
...and Bob's your uncle!
#garyth's regex work well actually if you use the regmatches/grexprcombo to extract the pattern instead of sub:
regmatches(yearEntries,
gregexpr("(?<!\\d)\\d{2}(?!-|\\/|\\d)|\\d{4}",yearEntries,perl=TRUE))
[[1]]
[1] "79" "80" "99"
[[2]]
[1] "08"
[[3]]
[1] "2008"
[[4]]
[1] "96"
[[5]]
[1] "70"
[[6]]
[1] "95"
[[7]]
[1] "70"
[[8]]
[1] "15"
[[9]]
[1] "60"
[[10]]
[1] "70"
[[11]]
character(0)
[[12]]
[1] "2013"
[[13]]
[1] "1992" "1993"
To only keep the first matching pattern:
sapply(regmatches(yearEntries,gregexpr("(?<!\\d)\\d{2}(?!-|\\/|\\d)|\\d{4}",yearEntries,perl=TRUE)),`[`,1)
[1] "79" "08" "2008" "96" "70" "95" "70" "15" "60" "70" NA "2013" "1992"
regmatches("07/09/13",gregexpr("(?<!\\d)\\d{2}(?!-|\\/|\\d)|\\d{4}","07/09/13",perl=TRUE))
[[1]]
[1] "13"

colsplit to split into individual characters

colsplit in package reshape2 can be used to split character data:
colsplit(c("A_1", "A_2", "A_3"), pattern="_", c("Letter", "Number"))
Letter Number
1 A 1
2 A 2
3 A 3
In his paper "Rehaping data with the Reshape Package", Hadley Wickham gives an example of using colsplit to split data into individual characters. His example should produce the above from the data c("A1", "A2", "A3") which he does by omitting the pattern argument. But this throws an error.
The documentation for str_split_fixed which colsplit calls says that setting pattern="" will split into individual characters, but this does not work.
Is there any way to use colsplit so that it splits into individual character.
This is R 3.1.1 and packages are up to date.
The problem is that you are referring to an article about "reshape" but are using "reshape2". The two are not the same and they don't work the same:
library(reshape)
library(reshape2)
reshape:::colsplit(c("A1", "A2", "A3"), "", c("V1", "V2"))
# V1 V2
# 1 A 1
# 2 A 2
# 3 A 3
reshape2:::colsplit(c("A1", "A2", "A3"), "", c("V1", "V2"))
# V1 V2
# 1 NA A1
# 2 NA A2
# 3 NA A3
If you don't have to go the colsplit way, there are other options:
do.call(rbind, strsplit(c("A1", "A2", "A3"), "", fixed = TRUE))
# [,1] [,2]
# [1,] "A" "1"
# [2,] "A" "2"
# [3,] "A" "3"
Or, a more general approach (for example characters followed by numbers, not necessarily one character each):
do.call(rbind, strsplit(c("A1", "A2", "A3"),
split = "(?<=[a-zA-Z])(?=[0-9])",
perl = TRUE))
# [,1] [,2]
# [1,] "A" "1"
# [2,] "A" "2"
# [3,] "A" "3"
Using qdap:
library(qdap)
colSplit(c("A1", "A2", "A3"), "")
## X1 X2
## 1 A 1
## 2 A 2
## 3 A 3

how to use regexp to return small part of character based on pattern

This should be easy for anyone who understands regular expressions as I'm struggling to do.
I have a vector of strings that looks like
strings<-c("jklsflk fKASJLJ (LN/WEC/WPS); jsdfjDFSDKTdfkls jfdjk kdkd(LN/WEC/WPS)",
"PEARYMP PEARYVIRGN_16 1 (LN/MP/MP)",
"08VERMLN XF03 08VERMLN_345_3 (XF/CIN/*)")
I want to convert this vector into a dataframe where each row is from an element of the original vector with 3 columns where each column comes from the part in parenthesis. So the result here would be
col1 col2 col3
"LN" "WEC" "WPS"
"LN" "MP" "MP"
"XF" "CIN" "*"
If there are more than one instance of the pattern in a string then it should take the first instance.
I think my main problem is that ( is a special character and I'm trying to escape it \( but I get an error that \( is an unrecognized escape character so I'm just a little lost.
Sounds like you're forgetting to escape the \ in \(, i.e. \\(:
do.call(rbind, strsplit(sub('.*?\\((.*?)\\).*', '\\1', strings), split = "/"))
[,1] [,2] [,3]
[1,] "LN" "WEC" "WPS"
[2,] "LN" "MP" "MP"
[3,] "XF" "CIN" "*"
1) We define a pattern that matches
left-paren non-slashes slash non-slashes slash non-right-parens remainder
which correspond to the following respectively:
\\( ([^/]+) / ([^/]+) / ([^)]+) .*
Now extract the parenthesized portions using strapplyc and simplify into a matrix. The code is:
library(gsubfn)
pat <- "\\(([^/]+)/([^/]+)/([^)]+).*"
strapplyc(strings, pat, simplify = cbind)
giving:
[,1] [,2] [,3]
[1,] "LN" "WEC" "WPS"
[2,] "LN" "MP" "MP"
[3,] "XF" "CIN" "*"
2) This alternative uses strapplyc nested in strapply. The regular expressions are slightly simpler and its still basically one line of code but that code line is longer. The first regex picks out everything between the first set of parens and the second extracts the slash-separated fields:
strapply(strings, "\\(([^)]+).*", ~ strapplyc(x, "[^/]+")[[1]], simplify = rbind)
REVISED Some improvements to first solution plus a variation as second solution.

Split column label by number of letters/characters in R

I have a large dataset where all column headers are individual IDS, each 8 characters in length. I would like to split those individual IDs into 2 rows, where the first row of IDs contains the first 7 characters, and the second row contains just the last character.
Current dataset:
ID1: Indiv01A Indiv01B Indiv02A Indiv02B Speci03A Speci03B
Intended dataset:
ID1: Indiv01 Indiv01 Indiv02 Indiv02 Speci03 Speci03
ID2: A B A B A B
I've looked through other posts on splitting data, but they all seem to have a unique way to separate the column name (ie: there's a comma separating the 2 components, or a period).
This is the code I'm thinking would work best, but I just can't figure out how to code for "7 characters" as the split point, rather than a comma:
sapply(strsplit(as.character(d$ID), ",")
Any help would be appreciated.
Here's a regular expression for a solution with strsplit. It splits the string between the 7th and the 8th character:
ID1 <- c("Indiv01A", "Indiv01B", "Indiv02A", "Indiv02B", "Speci03A", "Speci03B")
res <- strsplit(ID1, "(?<=.{7})", perl = TRUE)
# [[1]]
# [1] "Indiv01" "A"
#
# [[2]]
# [1] "Indiv01" "B"
#
# [[3]]
# [1] "Indiv02" "A"
#
# [[4]]
# [1] "Indiv02" "B"
#
# [[5]]
# [1] "Speci03" "A"
#
# [[6]]
# [1] "Speci03" "B"
Now, you can use rbind to create two columns:
do.call(rbind, res)
# [,1] [,2]
# [1,] "Indiv01" "A"
# [2,] "Indiv01" "B"
# [3,] "Indiv02" "A"
# [4,] "Indiv02" "B"
# [5,] "Speci03" "A"
# [6,] "Speci03" "B"
Explanation of the regex pattern:
(?<=.{7})
The (?<=) is a (positive) lookbehind. It matches any position that is preceded by the specified pattern. Here, the pattern is .{7}. The dot (.) matches any character. {7} means 7 times. Hence, the regex matches the position that is preceded by exactly 7 characters.
Here is a gsubfn solution:
library(gsubfn)
strapplyc(ID1, "(.*)(.)", simplify = cbind)
which gives this matrix:
[,1] [,2] [,3] [,4] [,5] [,6]
[1,] "Indiv01" "Indiv01" "Indiv02" "Indiv02" "Speci03" "Speci03"
[2,] "A" "B" "A" "B" "A" "B"
or use rbind in place of cbind if you want two columns (rather than two rows).
There are a couple of ways you could go about this.
To extract the final character
First, with substr:
new.vec <- sapply(old.vec, function(x) substr(x, nchar(x), nchar(x)))
or, with sub:
new.vec <- sub('.*(.)', '\\1', old.vec)
where old.vec is the vector of strings that you want to split.
For interest, the latter option uses a regular expression that translates to: "capture (indicating by surrounding with parentheses) the single character (.) that follows zero or more other characters (.*), and replace matches with the captured content (\\1)". For more info, see ?gsub, and here.
The above options allow for varying string lengths. However, if you do always want to split after 7 characters, and the second part of the string always has just a single character, then the following should work:
new.vec <- substr(old.vec, 8, 8)
(Edited to include method for extracting the first part of the string.)
To extract all but the final character
The process is similar.
new.vec <- sapply(old.vec, function(x) substr(x, 1, nchar(x) - 1))
new.vec <- sub('(.*).', '\\1', old.vec)
new.vec <- substr(old.vec, 1, 7)

Stop after first match found (str_match)

Is there an option to stop the search after the first "match" is found using str_match? Something equivalent to grep's "m"? I looked in stringr package, but couldnt find anything. Perhaps I missed it?
In a given string:
str <- "This is a 12-month study cycle"
I'm using the below to extract: 12-month from it
str_match(str, "(?i)(\\w+)[- ](month|months|week|weeks)")[1]
But if the string str extends to:
"This is a 12-month study cycle. In the 2 month period,blah blah...".
I'd like the search to just stop and retrieve 12-month and not get both: 12-month and 2-month. Any idea how I can do this?
How about this ?
str <- "This is a 12-month study cycle"
regmatches(str, regexpr("(?i)(\\w+)[- ](month|months|week|weeks)", str) )
[1] "12-month"
str2 <- "This is a 12-month study cycle. In the 2 month period,blah blah..."
regmatches(str2, regexpr("(?i)(\\w+)[- ](month|months|week|weeks)", str2) )
[1] "12-month"
Try stringi package. If you want to match all, use stri_match_all_regex, if just first or last use stri_match_first_regex or stri_match_last_regex.
stri_match_first_regex(str, "(?i)(\\w+)[- ](month|months|week|weeks)")
[,1] [,2] [,3]
[1,] "12-month" "12" "month"
stri_match_all_regex(str, "(?i)(\\w+)[- ](month|months|week|weeks)")
[[1]]
[,1] [,2] [,3]
[1,] "12-month" "12" "month"
[2,] "2 month" "2" "month"