Detecting sequencing using regexes - regex

Imagine I have multiple character strings in a list like this:
[[1]]
[1] "1-FA-1-I2-1-I2-1-I2-1-EX-1-I2-1-I3-1-FA-1-"
[2] "-1-I2-1-TR-1-"
[3] "-1-I2-1-FA-1-I3-1-"
[4] "-1-FA-1-FA-1-NR-1-I3-1-I2-1-TR-1-"
[5] "-1-I2-1-"
[6] "-1-I2-1-FA-1-I2-1-"
[7] "-1-I3-1-FA-1-QU-1-"
[8] "-1-I2-1-I2-1-I2-1-NR-1-I2-1-I2-1-NR-1-"
[9] "-1-I2-1-"
[10] "-1-NR-1-I3-1-QU-1-I2-1-I3-1-QU-1-NR-1-I2-1-"
[11] "-1-NR-1-QU-1-QU-1-I2-1-"
I want to use a regex to detect the particular strings where a certain substring precedes another substring, but not necessarily directly preceding the other substring.
For example, let's say that we are looking for FA preceding EX. This would need to match 1 in the list. Even though FA has -1-I2-1-I2-1-I2-1- between itself and EX, the FA still occurs before the EX, hence a match is expected.
How can a generic regex be defined that identifies strings where certain substrings appear before another substring in this manner?

You may use grep.
x <- c("1-FA-1-I2-1-I2-1-I2-1-EX-1-I2-1-I3-1-FA-1-" ,"-1-I2-1-TR-1-")
grepl("FA.*EX", x)
#[1] TRUE FALSE
grep("FA.*EX", x)
#[1] 1

Related

Matching special character in R

Hi I have the following data.
shopping_list <- c("apples x4", "bag of flour", "bag of sugar", "milk x2",
"appple+20gfree",
"BELI HG MSWAT ALA +VAT T 100g BAR WR",
"TOOLAIT CASSE+LSST+SSSRE 40g SAC MDC")
In my second step I remove all whitespace in shopping_list.
require(stringr)
shopping_list_trim <- str_replace_all(shopping_list, fixed(" "), "")
print(shopping_list_trim)
[1] "applesx4" "bagofflour" "bagofsugar"
[4] "milkx2" "appple+20gfree" "BELIHGMSWATALA+VATT100gBARWR"
[7] "TOOLAITCASSE+LSST+SSSRE40gSACMDC"
If I want to extract the string that does not contain plus sign I use the following code.
str_extract(shopping_list_trim, "^[^+]+$")
[1] "applesx4" "bagofflour" "bagofsugar" "milkx2" NA NA NA
Would like to help to extract the string that contain plus sign.
I would like the output to be the following one.
NA NA NA NA "appple+20gfree"
"BELIHGMSWATALA+VATT100gBARWR" "TOOLAITCASSE+LSST+SSSRE40gSACMDC"
Does anybody have idea how to extract only string that contains plus sign?
This will do the trick
> str_extract(shopping_list_trim, "^(?=.*\\+)(.+)$")
[1] NA
[2] NA
[3] NA
[4] NA
[5] "appple+20gfree"
[6] "BELIHGMSWATALA+VATT100gBARWR"
[7] "TOOLAITCASSE+LSST+SSSRE40gSACMDC"
Regex Breakdown
^(?=.*\\+) #Lookahead to check if there is one plus sign
(.+)$ #Capture the string if the above is true
If you can't/don't want to use look-arounds, try
^.*\+.*$
It matches anything followed by a + followed by anything :)
See it work here at regex101.
Regards

How to allow for arbitrary number of wildcards in regexes?

I have a list of character strings:
> head(g_patterns_clean_strings)
[[1]]
[1] "1FAFA"
[[2]]
[1] "FA,TRFA"
[[3]]
[1] "FAEX"
I am trying to identify specific patterns in these character strings, as such:
library(devtools)
g_patterns_clean <- source_gist("164f798524fd6904236a")[[1]]
g_patterns_clean_strings <- source_gist("af70a76691aacf05c1bb")[[1]]
FA_EX_logic_vector <- grepl(g_patterns_clean_strings, pattern = "(FAEX|EXFA)+")
FA_EX_cluster <- subset(g_patterns_clean, FA_EX_logic_vector)
Let's now say that I want to allow for an arbitrary number of other characters in between FA and EX (or EX and FA), how can I specify that in the regex above?
This is a flexible generalization of #eipi10's answer:
(FA.{0,2}EX|EX.{0,2}FA)
The . matches any character, and the {0,2} quantifier matches between 0 and 2 occurrences of .

difference between [0-9] n times and [0-9]{n} in R regexp

Both are supposed to the best of my knowledge to be the same but I actually see a difference, look at this minimal example from this question:
a<-c("/Cajon_Criolla_20141024","/Linon_20141115_20141130",
"/Cat/LIQUID",
"/c_puertas_20141206_20141107",
"/C_Puertas_3_20141017_20141018",
"/c_puertas_navidad_20141204_20141205")
sub("(.*?)_([0-9]{8})(.*)$","\\2",a)
[1] "20141024" "20141130" "/Cat/LIQUID" "20141107" "20141018"
[6] "20141205"
sub("(.*?)_([0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9])(.*)$","\\2",a)
[1] "20141024" "20141115" "/Cat/LIQUID" "20141206" "20141017"
[6] "20141204"
What am I missing? Or is this a bug?
This is a bug in the TRE library related to greedy modifiers and capture groups. See:
SO question with similar issue
Issue #11 on TRE repo
Issue #21.
Setting perl=TRUE gives the same answer (as expected) for both expressions:
> sub("(.*?)_([0-9]{8})(.*)$","\\2",a,perl=TRUE)
[1] "20141024" "20141115" "/Cat/LIQUID" "20141206" "20141017" "20141204"
> sub("(.*?)_([0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9])(.*)$","\\2",a,perl=TRUE)
[1] "20141024" "20141115" "/Cat/LIQUID" "20141206" "20141017" "20141204"
Though I was initially convinced by BrodieG answer, it seems that [0-9] n times and [0-9]{n} are indeed different, at least for the "tre" regexp motor. According to http://www.regular-expressions.info the {} operator is greedy, [0-9] is not.
Hence the right regular expression in my case should be:
sub("(.*?)_([0-9]{8}?)(.*)$","\\2",a)
Making all the difference:
sub("(.*?)_([0-9]{8})(.*)$","\\2",a)
[1] "20141024" "20141130" "/Cat/LIQUID" "20141107" "20141018"
[6] "20141205"
sub("(.*?)_([0-9]{8}?)(.*)$","\\2",a)
[1] "20141024" "20141115" "/Cat/LIQUID" "20141206" "20141017"
[6] "20141204"
And even
> sub("(.*)_([0-9]{8}?)(.*)$","\\2",a)
[1] "20141024" "20141115" "/Cat/LIQUID" "20141206" "20141017"
[6] "20141204"
Interpretation: 1) tre considers ? as "evaluate next atom the first time you can match this atom". This is always true for ".?" as everything matches, and it switches to _[0-9]{8}. When reaching the first group of 6 numbers, if {} is not greedy (no ?), as (.) matches also the first 8 numbers, the search continues to see if an other occurrence of "_[0-9]{8}" can be found on the line. If meeting the second set of 8 figures, it also memorizes it as a matching pattern, then it reaches the end of the line, the last matching pattern is kept and [0-9]{8} is matched to the second set of 8 numbers.
2) When {} operator is modified by ? The search stops the first time it sees 8 numbers, check if _(.*) can be matched to the rest. It can, so it returns the first set of 8 numbers.
Note that the perl regexp motor works differently,
1) ? after {} doesn't change a thing:
sub("(.*)_([0-9]{8})","\\2",a,perl=TRUE)
[1] "20141024" "20141130" "/Cat/LIQUID" "20141107" "20141018"
[6] "20141205"
sub("(.*)_([0-9]{8}?)","\\2",a,perl=TRUE)
[1] "20141024" "20141130" "/Cat/LIQUID" "20141107" "20141018"
[6] "20141205"
2) the ? applied to .* makes it to stop at the first set of 8 figures:
sub("(.*?)_([0-9]{8}).*","\\2",a,perl=TRUE)
[1] "20141024" "20141115" "/Cat/LIQUID" "20141206" "20141017"
[6] "20141204"
sub("(.*)_([0-9]{8}).*","\\2",a,perl=TRUE)
[1] "20141024" "20141130" "/Cat/LIQUID" "20141107" "20141018"
[6] "20141205"
From these two observations, it seems that the two engines interpret differently the greediness in two different instances. I always found the greediness concept to be a bit vague ...

split on last occurrence of digit, take 2nd part

If I have a string and want to split on the last digit and keep the last part of the split hpw can I do that?
x <- c("ID", paste0("X", 1:10, state.name[1:10]))
I'd like
[1] NA "Alabama" "Alaska" "Arizona" "Arkansas"
[6] "California" "Colorado" "Connecticut" "Delaware" "Florida"
[11] "Georgia"
But would settle for:
[1] "ID" "Alabama" "Alaska" "Arizona" "Arkansas"
[6] "California" "Colorado" "Connecticut" "Delaware" "Florida"
[11] "Georgia"
I can get the first part by:
unlist(strsplit(x, "[^0-9]*$"))
But want the second part.
Thank you in advance.
You can do this one easy step with a regular expression:
gsub("(^.*\\d+)(\\w*)", "\\2", x)
Results in:
[1] "ID" "Alabama" "Alaska" "Arizona" "Arkansas" "California" "Colorado" "Connecticut"
[9] "Delaware" "Florida" "Georgia"
What the regex does:
"(^.*\\d+)(\\w*)": Look for two groups of characters.
The first group (^.*\\d+) looks for any digit followed by at least one number at the start of the string.
The second group \\w* looks for an alpha-numeric character.
The "\\2" as the second argument to gsub() means to replace the original string with the second group that the regex found.
library(stringr)
unlist(lapply(str_split(x, "[0-9]"), tail,n=1))
gives
[1] "ID" "Alabama" "Alaska" "Arizona" "Arkansas" "California" "Colorado" "Connecticut" "Delaware"
[10] "Florida" "Georgia"
I would look at the documentation stringr for (most possibly) an even better approach.
This seems a bit clunky, but it works:
state.pt2 <- unlist(strsplit(x,"^.[0-9]+"))
state.pt2[state.pt2!=""]
It would be nice to remove the ""'s generated by the match at the start of the string but I can't figure that out.
Here's another method using substr and gregexpr too that avoids having to subset the results:
substr(x,unlist(lapply(gregexpr("[0-9]",x),max))+1,nchar(x))
gsubfn
Try this gsubfn solution:
> library(gsubfn)
> strapply(x, ".*\\d(\\w*)|$", ~ if (nchar(z)) z else NA, simplify = TRUE)
[1] NA "Alabama" "Alaska" "Arizona" "Arkansas"
[6] "California" "Colorado" "Connecticut" "Delaware" "Florida"
[11] "Georgia"
It matches the last digit followed by word characters and returns the word characters or if that fails it matches the end of line (to ensure that it matches something). If the first match succeeded then return it; otherwise, the back reference will be empty so return NA.
Note that the formula is a short hand way of writing the function function(z) if (nchar(z)) z else NA and that function could alternately replace the formula at the expense of a slightly more keystrokes.
gsub
A similar strategy could also work using just straight gsub but requires two lines and a marginally more complex regular expression. Here we use the second alternative to slurp up non-matches from the first alternative:
> s <- gsub(".*\\d(\\w*)|.*", "\\1", x)
> ifelse(nchar(s), s, NA)
[1] NA "Alabama" "Alaska" "Arizona" "Arkansas"
[6] "California" "Colorado" "Connecticut" "Delaware" "Florida"
[11] "Georgia"
EDIT: minor improvements

Notepad++ Regex - patterns from Autonomy IDOL queries / Logs

I have a document which looks something like:
sort=SIZE:NumberDecreasing
FieldText=(((EQUAL{226742}:LocationId)) AND ())
FieldText=(((EQUAL{226742}:LocationId)) AND ((EQUAL{1}:LOD AND NOTEQUAL{1}:SCR AND EMPTY{}:RPDCITYID AND NOTEQUAL{1}:Industrial)))
FieldText=( NOT EQUAL{1}:ISSCHEME AND EQUAL{215629}:LocationId)
sort=DEALDATE:decreasing
From this I would like the word before a colon, and if there are {} brackets, before those too, a colon, and then the word after the colon. These should ideally be the only things left in the file, each on their own new line.
Output would then look like:
SIZE:NumberDecreasing
EQUAL:LocationId
EQUAL:LocationId
EQUAL:LOD
NOTEQUAL:SCR
EMPTY:RPDCITYID
NOTEQUAL:Industrial
EQUAL:ISSCHEME
EQUAL:LocationId
DEALDATE:decreasing
The closest I have come so far is:
Find:
^.?+ {[0-9]}:([a-zA-Z]+)
Replace with:
...\1:\2...
with the intent to run it several times, and later replace ... with \n
I can then remove multiple newlines.
Context: this is for a log analysis I am performing, I have already removed datestamps, and reduced elements of the query down to the sort and FieldText parameters
I do not have regular UNIX tools - I am working in a windows environment
The original log looks like:
03/11/2011 16:25:44 [9] ACTION=Query&summary=Context&print=none&printFields=DISPLAYNAME%2CRECORDTYPE%2CSTREET%2CTOWN%2CCOUNTY%2CPOSTCODE%2CLATITUDE%2CLONGITUDE&DatabaseMatch=Autocomplete&sort=RECORDTYPE%3Areversealphabetical%2BDRETITLE%3Aincreasing&maxresults=200&FieldText=%28WILD%7Bbournemou%2A%7D%3ADisplayName%20NOT%20MATCH%7BScheme%7D%3ARecordType%29 (10.55.81.151)
03/11/2011 16:25:45 [9] Returning 23 matches
03/11/2011 16:25:45 [9] Query complete
03/11/2011 16:25:46 [8] ACTION=GetQueryTagValues&documentCount=True&databaseMatch=Deal&minScore=70&weighfieldtext=false&FieldName=TotalSizeSizeInSquareMetres%2CAnnualRental%2CDealType%2CYield&start=1&FieldText=%28MATCH%7BBournemouth%7D%3ATown%29 (10.55.81.151)
03/11/2011 16:25:46 [12] ACTION=Query&databaseMatch=Deal&maxResults=50&minScore=70&sort=DEALDATE%3Adecreasing&weighfieldtext=false&totalResults=true&PrintFields=LocationId%2CLatitude%2CLongitude%2CDealId%2CFloorOrUnitNumber%2CAddressAlias%2A%2CEGAddressAliasID%2COriginalBuildingName%2CSubBuilding%2CBuildingName%2CBuildingNumber%2CDependentStreet%2CStreet%2CDependentLocality%2CLocality%2CTown%2CCounty%2CPostcode%2CSchemeName%2CBuildingId%2CFullAddress%2CDealType%2CDealDate%2CSalesPrice%2CYield%2CRent%2CTotalSizeSizeInSquareMetres%2CMappingPropertyUsetype&start=1&FieldText=%28MATCH%7BBournemouth%7D%3ATown%29 (10.55.81.151)
03/11/2011 16:25:46 [8] GetQueryTagValues complete
03/11/2011 16:25:47 [12] Returning 50 matches
03/11/2011 16:25:47 [12] Query complete
03/11/2011 16:25:51 [13] ACTION=Query&print=all&databaseMatch=locationidsearch&sort=RELEVANCE%2BPOSTCODE%3Aincreasing&maxResults=10&start=1&totalResults=true&minscore=70&weighfieldtext=false&FieldText=%28%20NOT%20LESS%7B50%7D%3AOFFICE%5FPERCENT%20AND%20EXISTS%7B%7D%3AOFFICE%5FPERCENT%20NOT%20EQUAL%7B1%7D%3AISSCHEME%29&Text=%28Brazennose%3AFullAddress%2BAND%2BHouse%3AFullAddress%29&synonym=True (10.55.81.151)
03/11/2011 16:25:51 [13] Returning 3 matches
03/11/2011 16:25:51 [13] Query complete
The purpose of the whole exercise is to find out which fields are being queried and sorted upon (and how we are querying/sorting upon them) - to this end, the output could also usefully be distinct - although that is not essential.
The Perl program below is complete, and includes your sample data in the source. It produces exactly the output you describe, including reporting NOT EQUAL{1}:ISSCHEME as EQUAL:ISSCHEME because of the intermediate space.
use strict;
use warnings;
while (<DATA>) {
print "$1:$2\n" while /(\w+) (?: \{\d*\} )? : (\w+)/xg;
}
__DATA__
sort=SIZE:NumberDecreasing
FieldText=(((EQUAL{226742}:LocationId)) AND ())
FieldText=(((EQUAL{226742}:LocationId)) AND ((EQUAL{1}:LOD AND NOTEQUAL{1}:SCR AND EMPTY{}:RPDCITYID AND NOTEQUAL{1}:Industrial)))
FieldText=( NOT EQUAL{1}:ISSCHEME AND EQUAL{215629}:LocationId)
sort=DEALDATE:decreasing
OUTPUT
SIZE:NumberDecreasing
EQUAL:LocationId
EQUAL:LocationId
EQUAL:LOD
NOTEQUAL:SCR
EMPTY:RPDCITYID
NOTEQUAL:Industrial
EQUAL:ISSCHEME
EQUAL:LocationId
DEALDATE:decreasing