Substring like matches inside regular expressions? - regex

I have a string on which I need to do a regex match (I'm working in R). It looks like:
"354542676655341568:1373344735:270969722:text1,text2,text4,text8"
This string has 4 parts separated by colens (:). I have multiple strings with different values, but composed of the same 4 parts.
The first numerical part I plan to match using "[0-9]{18}"
For the second part (it is a timestamp), I have a piece of code that generates a regex for a range that I'll append. A sample looks like this:
":0*13733([0-3][0-9]{4}|4([0-3][0-9]{3}|4([0-7][0-9]{2}|800))):"
This above pattern matches for all numbers between 1373300000 & 1373344800.
The Third part also is a plain [0-9]{9}
The problem is the fourth part, where I'll have to match the text part. I'll have a list of text content like text1, text3, text5. I need to accept the string if it has atleast one of the texts from the list. It's more like a substring match for the fourth part.
I've thought of splitting the text, but in my application, it would be a poor design with high resource costs. Hence, I'd like to generate one regex that does the entire match together.
I tried a few things to test this out, but I'm getting false positives. Any help available?
checktext = "check:text1,text2,text3"
> grepl("check:[a-zA-Z0-9 ]+,text2",checktext)
[1] TRUE
> grepl("check:[a-zA-Z0-9, ]+,text2",checktext)
[1] TRUE
> grepl("check:[a-zA-Z0-9, ]+,[text3|text2]",checktext)
[1] TRUE
> grepl("check:[a-zA-Z0-9, ]+,[text3|text4]",checktext)
[1] TRUE
> grepl("check:[a-zA-Z0-9, ]+,[text5|text4]",checktext)
[1] TRUE
> grepl("check:[a-zA-Z0-9, ]+,[text5|text4][a-zA-Z0-9, ]$",checktext)
[1] FALSE
> grepl("check:[a-zA-Z0-9, ]+,[text5|text3][a-zA-Z0-9, ]$",checktext)
[1] FALSE
> grepl("check:[a-zA-Z0-9, ]+,[text5|text3][a-zA-Z0-9, ]+?$",checktext)
[1] TRUE
> grepl("check:[a-zA-Z0-9, ]+,[text5|text4][a-zA-Z0-9, ]+?$",checktext)
[1] TRUE
> grepl("check:.*[text1].*",checktext)
[1] TRUE
> grepl("check:.*[text2].*",checktext)
[1] TRUE
> grepl("check:.*[text3].*",checktext)
[1] TRUE
> grepl("check:.*[text2|text4].*",checktext)
[1] TRUE
> grepl("check:.*[text5|text4].*",checktext)
After #sgibb 's reply, I put all the parts together to make the final pattern as:
"[0-9]{18}:0*13733([0-3][0-9]{4}|4([0-3][0-9]{3}|4([0-7][0-9]{2}|800))):[0-9]{9}:[a-zA-Z0-9, ]+,(Samsung|Nokia)"
and my text string was:
"354542676655341568:1373344735:270969722:Samsung,Galaxy"
It didn't match. Is it due to putting all of them together? When I removed the last (text) part from the regex, it matched.
> finalpattern
[1] "[0-9]{18}:0*13733([0-3][0-9]{4}|4([0-3][0-9]{3}|4([0-7][0-9]{2}|800))):[0-9]{9}:"
> keysample
[1] "354542676655341568:1373344735:270969722:Samsung,Galaxy"
> grepl(finalpattern,keysample)
[1] TRUE

IMHO you use the [ wrong. A [ contains a class of characters to match (means at least one of the character in [ should match). If you want to group a pattern/string (e.g. text5|text4) you have to use (:
grepl("check:[a-zA-Z0-9, ]+,(text3|text4)",checktext)
# [1] TRUE
grepl("check:[a-zA-Z0-9, ]+,(text5|text4)",checktext)
# [1] FALSE
This should remove most of your false-positives.
Address your edit:
Your regular expression is wrong (the part after the :).
[a-zA-Z0-9, ]+,: you look for alphanumeric characters (BTW see ?regex: classes [:alnum:]) occurring at least ones and followed by a ,. This will match agains Samsung.
Next you look for (Samsung|Nokia) but there is only Galaxy left.
There are multiple solutions:
"[[:alnum:], ]*(Samsung|Nokia)[[:alnum:], ]*"
"(Samsung|Nokia),[[:alnum:], ]+"
".*(Samsung|Nokia).*"
# ...
Or you should think about splitting your string at : and analyze each part separately.

Related

How to find pattern next to a given string using regex in R

I have a string formatted for example like "segmentation_level1_id_10" and would like to extract the level number associated to it (i.e. the number directly after the word level).
I have a solution that does this in two steps, first finds the pattern level\\d+ then replaces the level with missing after, but I would like to know if it's possible to do this in one step just with str_extract
Example below:
library(stringr)
segmentation_id <- "segmentation_level1_id_10"
segmentation_level <- str_replace(str_extract(segmentation_id, "level\\d+"), "level", "")
One way to do it is by using a stringr library str_extract function with a regex featuring a lookbehind:
> library(stringr)
> s = "segmentation_level1_id_10"
> str_extract(s, "(?<=level)\\d+")
## or to make sure we match the level after _: str_extract(s, "(?<=_level)\\d+")
[1] "1"
Or using str_match that allows extracting captured group texts:
> str_match(s, "_level(\\d+)")[,2]
[1] "1"
It can be done with base R using the gsub and making use of the same capturing mechanism used in str_match, but also using a backreference to restore the captured text in the replacement result:
> gsub("^.*level(\\d+).*", "\\1", s)
[1] "1"

How to substitute a letter with its lowercase after finding it through regex match in R

I have a bunch of strings with a hyphen in it. I want to remove the hyphen and convert the following letter to lower case while keeping all the other letters intact. How do you accomplish task in R?
test <- "Kwak Min-Jung"
gsub(x=test,pattern="-(\\w)",replacement="\\1")
# [1] "Kwak MinJung" , Not what I want
# I want it to convert to "Kwak Minjung"
Try this:
> gsub("-(\\w)", "\\L\\1", test, perl = TRUE)
[1] "Kwak Minjung"
or this:
> library(gsubfn)
> gsubfn("-(\\w)", tolower, test)
[1] "Kwak Minjung"
Use \\L or \\U to change the case in the replacement argument. You can use \\E to end the effect of the case conversion.
gsub(x=test,pattern="-(\\w)",replacement="\\L\\1", perl=TRUE)
# [1] "Kwak Minjung"

regular expression -- greedy matching?

I am trying to extract a leading string by stripping off an optional trailing string, where the trailing strings are a subset of possible leading strings but not vice versa. Suppose the leading string is like [a-z]+ and the trailing string is like c. Thus from "abc" I want to extract "ab", and from "ab" I also want to get "ab". Something like this:
^([a-z]+)(?:c|)
The problem is that the [a-z]+ matches the entire string, using the empty option in the alternative, so the grabbed value is "abc" or "ab". (The (?: tells it not to grab the second part.) I want some way to make it take the longer option, or the first option, in the alternative, and use that to determine what matches the first part.
I have also tried putting the desired target inside both of the alternatives:
^([a-z]+)c|^([a-z]+)
I think that it should prefer to match the first one of the two possible alternatives, but I get the same results as above.
I am doing this in R, so I can use either the POSIX or the Perl regex library.
(The actual problem involves futures trading symbols. These have a root "instrument name" like [A-Z0-9]+, followed by an "expiration code" like [FGHJKMNQUVXZ][0-9]{1,2}. Given a symbol like "ZNH3", I want to strip the "H3" to get "ZN". But if I give it "ZN" I also want to get back "ZN".)
Try this:
> library(gsubfn)
> strapplyc(c("abc", "abd"), "^(\\w+?)c?$", simplify = TRUE)
[1] "ab" "abd"
and even easier:
> sub("c$", "", c("abc", "abd"))
[1] "ab" "abd"
Here's a working regular expression:
vec <- c("ZNH3", "ZN", "ZZZ33", "ABF")
sub("(\\w+)[FGHJKMNQUVXZ]\\d{1,2}", "\\1", vec)
# [1] "ZN" "ZN" "ZZ" "ABF"
A variation on the non-greedy answers using base code only.
codes <- c("ZNH3", "CLZ4")
matched <- regmatches(codes, regexec("^([A-Z0-9]+?)[FGHJKMNQUVXZ][0-9]{1,2}$", codes))
# [[1]]
# [1] "ZNH3" "ZN"
#
# [[2]]
# [1] "CLZ4" "CL"
sapply(matched, `[[`, 2) # extract just codes
# [1] "ZN" "CL"
Use a 'non-greedy' match for the first part of the regex, followed by the definitions of your 'optional allowed suffixes' anchored by the 'end-of-string'...
This regex (.+?)([FGHJKMNQUVXZ][0-9]{1,2})?$matches...
(.+?) as few characters as possible
([FGHJKMNQUVXZ][0-9]{1,2})? followed by an allowable (but optional) suffix
$ followed by the end of string
The required result is in the first captured element of the match (however that may be referenced in 'r') :-)

R-regex: match strings not beginning with a pattern

I'd like to use regex to see if a string does not begin with a certain pattern. While I can use: [^ to blacklist certain characters, I can't figure out how to blacklist a pattern.
> grepl("^[^abc].+$", "foo")
[1] TRUE
> grepl("^[^abc].+$", "afoo")
[1] FALSE
I'd like to do something like grepl("^[^(abc)].+$", "afoo") and get TRUE, i.e. to match if the string does not start with abc sequence.
Note that I'm aware of this post, and I also tried using perl = TRUE, but with no success:
> grepl("^((?!hede).)*$", "hede", perl = TRUE)
[1] FALSE
> grepl("^((?!hede).)*$", "foohede", perl = TRUE)
[1] FALSE
Any ideas?
Yeah. Put the zero width lookahead /outside/ the other parens. That should give you this:
> grepl("^(?!hede).*$", "hede", perl = TRUE)
[1] FALSE
> grepl("^(?!hede).*$", "foohede", perl = TRUE)
[1] TRUE
which I think is what you want.
Alternately if you want to capture the entire string, ^(?!hede)(.*)$ and ^((?!hede).*)$ are both equivalent and acceptable.
There is now (years later) another possibility with the stringr package.
library(stringr)
str_detect("dsadsf", "^abc", negate = TRUE)
#> [1] TRUE
str_detect("abcff", "^abc", negate = TRUE)
#> [1] FALSE
Created on 2020-01-13 by the reprex package (v0.3.0)
I got stuck on the following special case, so I thought I would share...
What if there are multiple instances of the regular expression, but you still only want the first segment?
Apparently you can turn off the implicit greediness of the search
with specific perl wildcard modifiers
Suppose the string I wanted to process was
myExampleString = paste0(c(letters[1:13], "_", letters[14:26], "__",
LETTERS[1:13], "_", LETTERS[14:26], "__",
"laksjdl", "_", "lakdjlfalsjdf"),
collapse = "")
myExampleString
"abcdefghijklm_nopqrstuvwxyz__ABCDEFGHIJKLM_NOPQRSTUVWXYZ__laksjdl_lakdjlfalsjd"
and that I wanted only the first segment before the first "__".
I cannot simply search on "_", because single-underscore is
an allowable non-delimiter in this example string.
The following doesn't work. It instead gives me the first and second segments because of the default greediness (but not third, because of the forward-look).
gsub("^(.+(?=__)).*$", "\\1", myExampleString, perl = TRUE)
"abcdefghijklm_nopqrstuvwxyz__ABCDEFGHIJKLM_NOPQRSTUVWXYZ"
But this does work
gsub("^(.+?(?=__)).*$", "\\1", myExampleString, perl = TRUE)
"abcdefghijklm_nopqrstuvwxyz"
The difference is the greedy-modifier "?" after the wildcard ".+"
in the (perl) regular expression.

Complete word matching using grepl in R

Consider the following example:
> testLines <- c("I don't want to match this","This is what I want to match")
> grepl('is',testLines)
> [1] TRUE TRUE
What I want, though, is to only match 'is' when it stands alone as a single word. From reading a bit of perl documentation, it seemed that the way to do this is with \b, an anchor that can be used to identify what comes before and after the patter, i.e. \bword\b matches 'word' but not 'sword'. So I tried the following example, with use of Perl syntax set to 'TRUE':
> grepl('\bis\b',testLines,perl=TRUE)
> [1] FALSE FALSE
The output I'm looking for is FALSE TRUE.
"\<" is another escape sequence for the beginning of a word, and "\>" is the end.
In R strings you need to double the backslashes, so:
> grepl("\\<is\\>", c("this", "who is it?", "is it?", "it is!", "iso"))
[1] FALSE TRUE TRUE TRUE FALSE
Note that this matches "is!" but not "iso".
you need double-escaping to pass escape to regex:
> grepl("\\bis\\b",testLines)
[1] FALSE TRUE
Very simplistically, match on a leading space:
testLines <- c("I don't want to match this","This is what I want to match")
grepl(' is',testLines)
[1] FALSE TRUE
There's a whole lot more than this to regular expressions, but essentially the pattern needs to be more specific. What you will need in more general cases is a huge topic. See ?regex
Other possibilities that will work for this example:
grepl(' is ',testLines)
[1] FALSE TRUE
grepl('\\sis',testLines)
[1] FALSE TRUE
grepl('\\sis\\s',testLines)
[1] FALSE TRUE