Complete word matching using grepl in R - regex

Consider the following example:
> testLines <- c("I don't want to match this","This is what I want to match")
> grepl('is',testLines)
> [1] TRUE TRUE
What I want, though, is to only match 'is' when it stands alone as a single word. From reading a bit of perl documentation, it seemed that the way to do this is with \b, an anchor that can be used to identify what comes before and after the patter, i.e. \bword\b matches 'word' but not 'sword'. So I tried the following example, with use of Perl syntax set to 'TRUE':
> grepl('\bis\b',testLines,perl=TRUE)
> [1] FALSE FALSE
The output I'm looking for is FALSE TRUE.

"\<" is another escape sequence for the beginning of a word, and "\>" is the end.
In R strings you need to double the backslashes, so:
> grepl("\\<is\\>", c("this", "who is it?", "is it?", "it is!", "iso"))
[1] FALSE TRUE TRUE TRUE FALSE
Note that this matches "is!" but not "iso".

you need double-escaping to pass escape to regex:
> grepl("\\bis\\b",testLines)
[1] FALSE TRUE

Very simplistically, match on a leading space:
testLines <- c("I don't want to match this","This is what I want to match")
grepl(' is',testLines)
[1] FALSE TRUE
There's a whole lot more than this to regular expressions, but essentially the pattern needs to be more specific. What you will need in more general cases is a huge topic. See ?regex
Other possibilities that will work for this example:
grepl(' is ',testLines)
[1] FALSE TRUE
grepl('\\sis',testLines)
[1] FALSE TRUE
grepl('\\sis\\s',testLines)
[1] FALSE TRUE

Related

R grepl - matching two strings

I am facing problem with using grep/grepl function in R. When I run
grepl("[Aa][Bb][Cc]x", c("Abcx", "abCy"))
I got:
[1] TRUE FALSE
And it's OK. Similarly, for:
grepl("[Aa][Bb][Cc]y", c("Abcx", "abCy"))
I got:
[1] FALSE TRUE
And it's also allrighty. But when I write:
grepl("[Aa][Bb][Cc]x | [Aa][Bb][Cc]y", c("Abcx", "abCy"))
it gives me counter-intuitive
[1] FALSE FALSE
What's the problem?
You need to remove spaces around |:
grepl("[Aa][Bb][Cc]x|[Aa][Bb][Cc]y", c("Abcx", "abCy"))
These spaces matter. You might use a PCRE regex though with a (?x) modifier (see demo) that makes it possible to introduce some formatting whitespace in between subpatterns for better readability:
grepl("(?x)[Aa][Bb][Cc]x | [Aa][Bb][Cc]y", c("Abcx", "abCy"), perl=TRUE)
Or better use this shorter version:
grepl("[Aa][Bb][Cc][xy]", c("Abcx", "abCy"))
where the pattern is first shrunk to [Aa][Bb][Cc](x|y) and since these are single characters, I recommend using a character class ((x|y) -> [xy]).

R Wildcard matching for certain number of terms

Suppose I have a string and am searching for particular wildcard terms. For example:
x <- "AJSDKLAFJASFJABJKADL"
z <- stri_locate_all_regex(x, 'A*****AF')
I want to search for all terms that have any 5 characters in between A and AF, like ABJDKAAF or AJSDKLAF... However the above code does not work. Is there a simple way to do this that I am overlooking? Thank you!
In regular expressions (as opposed to standard wildcards that you might be used to), * means "0 or more of the preceding character", so "A*" means "0 or more A". You can't stack them like '****', for that you want '.' which means "one character".
z <- stri_locate_all_regex(x, 'A.....AF')
TL,DR: regex problem, not R problem.
For a simple way to do this, and by this I assume you mean that you want to use your wildcard characters as in the question, you can turn these into proper regular expressions using glob2rx(). A "wildcard" expression, also known as a "glob", is a sort of poor man's regular expression (?regex). For your expression, you can specify five ? characters, because in a glob, ? means any single character.
x <- c("ABCDEFAF", "XABCDEFAFX", "abcdeaf", "A55555AF", "A666666AF")
# the (simpler?) "wildcard" way
stringi::stri_detect_regex(x, glob2rx("A?????AF"))
## [1] TRUE FALSE FALSE TRUE FALSE
# the regular expression way (probably WRONG)
stringi::stri_detect_regex(x, "A.{5}AF")
## [1] TRUE TRUE FALSE TRUE FALSE
# the regular expression way (CORRECT)
stringi::stri_detect_regex(x, "^A.{5}AF$")
## [1] TRUE FALSE FALSE TRUE FALSE
This returns a logical vector if the wildcard matches.
By contrast, stri_locate_all_regex() returns a list of matrixes of dimensions 1, 2 where the columns are the starting and ending character positions of the matches within the string, or a pair of NA values if the pattern is not found.
Note that one of the differences in your wildcard/glob expression is that to get A + any five characters + AF without any preceding or trailing characters, you would need to specify the regular expression characters for the start and end of the string, as per above. Otherwise the match picks up "XABCDEFAFX" too. For a wildcard/glob, this is not a problem since the start and end of the expression match the beginning and end of the string:
> glob2rx("A?????AF")
[1] "^A.....AF$"

Substring like matches inside regular expressions?

I have a string on which I need to do a regex match (I'm working in R). It looks like:
"354542676655341568:1373344735:270969722:text1,text2,text4,text8"
This string has 4 parts separated by colens (:). I have multiple strings with different values, but composed of the same 4 parts.
The first numerical part I plan to match using "[0-9]{18}"
For the second part (it is a timestamp), I have a piece of code that generates a regex for a range that I'll append. A sample looks like this:
":0*13733([0-3][0-9]{4}|4([0-3][0-9]{3}|4([0-7][0-9]{2}|800))):"
This above pattern matches for all numbers between 1373300000 & 1373344800.
The Third part also is a plain [0-9]{9}
The problem is the fourth part, where I'll have to match the text part. I'll have a list of text content like text1, text3, text5. I need to accept the string if it has atleast one of the texts from the list. It's more like a substring match for the fourth part.
I've thought of splitting the text, but in my application, it would be a poor design with high resource costs. Hence, I'd like to generate one regex that does the entire match together.
I tried a few things to test this out, but I'm getting false positives. Any help available?
checktext = "check:text1,text2,text3"
> grepl("check:[a-zA-Z0-9 ]+,text2",checktext)
[1] TRUE
> grepl("check:[a-zA-Z0-9, ]+,text2",checktext)
[1] TRUE
> grepl("check:[a-zA-Z0-9, ]+,[text3|text2]",checktext)
[1] TRUE
> grepl("check:[a-zA-Z0-9, ]+,[text3|text4]",checktext)
[1] TRUE
> grepl("check:[a-zA-Z0-9, ]+,[text5|text4]",checktext)
[1] TRUE
> grepl("check:[a-zA-Z0-9, ]+,[text5|text4][a-zA-Z0-9, ]$",checktext)
[1] FALSE
> grepl("check:[a-zA-Z0-9, ]+,[text5|text3][a-zA-Z0-9, ]$",checktext)
[1] FALSE
> grepl("check:[a-zA-Z0-9, ]+,[text5|text3][a-zA-Z0-9, ]+?$",checktext)
[1] TRUE
> grepl("check:[a-zA-Z0-9, ]+,[text5|text4][a-zA-Z0-9, ]+?$",checktext)
[1] TRUE
> grepl("check:.*[text1].*",checktext)
[1] TRUE
> grepl("check:.*[text2].*",checktext)
[1] TRUE
> grepl("check:.*[text3].*",checktext)
[1] TRUE
> grepl("check:.*[text2|text4].*",checktext)
[1] TRUE
> grepl("check:.*[text5|text4].*",checktext)
After #sgibb 's reply, I put all the parts together to make the final pattern as:
"[0-9]{18}:0*13733([0-3][0-9]{4}|4([0-3][0-9]{3}|4([0-7][0-9]{2}|800))):[0-9]{9}:[a-zA-Z0-9, ]+,(Samsung|Nokia)"
and my text string was:
"354542676655341568:1373344735:270969722:Samsung,Galaxy"
It didn't match. Is it due to putting all of them together? When I removed the last (text) part from the regex, it matched.
> finalpattern
[1] "[0-9]{18}:0*13733([0-3][0-9]{4}|4([0-3][0-9]{3}|4([0-7][0-9]{2}|800))):[0-9]{9}:"
> keysample
[1] "354542676655341568:1373344735:270969722:Samsung,Galaxy"
> grepl(finalpattern,keysample)
[1] TRUE
IMHO you use the [ wrong. A [ contains a class of characters to match (means at least one of the character in [ should match). If you want to group a pattern/string (e.g. text5|text4) you have to use (:
grepl("check:[a-zA-Z0-9, ]+,(text3|text4)",checktext)
# [1] TRUE
grepl("check:[a-zA-Z0-9, ]+,(text5|text4)",checktext)
# [1] FALSE
This should remove most of your false-positives.
Address your edit:
Your regular expression is wrong (the part after the :).
[a-zA-Z0-9, ]+,: you look for alphanumeric characters (BTW see ?regex: classes [:alnum:]) occurring at least ones and followed by a ,. This will match agains Samsung.
Next you look for (Samsung|Nokia) but there is only Galaxy left.
There are multiple solutions:
"[[:alnum:], ]*(Samsung|Nokia)[[:alnum:], ]*"
"(Samsung|Nokia),[[:alnum:], ]+"
".*(Samsung|Nokia).*"
# ...
Or you should think about splitting your string at : and analyze each part separately.

R-regex: match strings not beginning with a pattern

I'd like to use regex to see if a string does not begin with a certain pattern. While I can use: [^ to blacklist certain characters, I can't figure out how to blacklist a pattern.
> grepl("^[^abc].+$", "foo")
[1] TRUE
> grepl("^[^abc].+$", "afoo")
[1] FALSE
I'd like to do something like grepl("^[^(abc)].+$", "afoo") and get TRUE, i.e. to match if the string does not start with abc sequence.
Note that I'm aware of this post, and I also tried using perl = TRUE, but with no success:
> grepl("^((?!hede).)*$", "hede", perl = TRUE)
[1] FALSE
> grepl("^((?!hede).)*$", "foohede", perl = TRUE)
[1] FALSE
Any ideas?
Yeah. Put the zero width lookahead /outside/ the other parens. That should give you this:
> grepl("^(?!hede).*$", "hede", perl = TRUE)
[1] FALSE
> grepl("^(?!hede).*$", "foohede", perl = TRUE)
[1] TRUE
which I think is what you want.
Alternately if you want to capture the entire string, ^(?!hede)(.*)$ and ^((?!hede).*)$ are both equivalent and acceptable.
There is now (years later) another possibility with the stringr package.
library(stringr)
str_detect("dsadsf", "^abc", negate = TRUE)
#> [1] TRUE
str_detect("abcff", "^abc", negate = TRUE)
#> [1] FALSE
Created on 2020-01-13 by the reprex package (v0.3.0)
I got stuck on the following special case, so I thought I would share...
What if there are multiple instances of the regular expression, but you still only want the first segment?
Apparently you can turn off the implicit greediness of the search
with specific perl wildcard modifiers
Suppose the string I wanted to process was
myExampleString = paste0(c(letters[1:13], "_", letters[14:26], "__",
LETTERS[1:13], "_", LETTERS[14:26], "__",
"laksjdl", "_", "lakdjlfalsjdf"),
collapse = "")
myExampleString
"abcdefghijklm_nopqrstuvwxyz__ABCDEFGHIJKLM_NOPQRSTUVWXYZ__laksjdl_lakdjlfalsjd"
and that I wanted only the first segment before the first "__".
I cannot simply search on "_", because single-underscore is
an allowable non-delimiter in this example string.
The following doesn't work. It instead gives me the first and second segments because of the default greediness (but not third, because of the forward-look).
gsub("^(.+(?=__)).*$", "\\1", myExampleString, perl = TRUE)
"abcdefghijklm_nopqrstuvwxyz__ABCDEFGHIJKLM_NOPQRSTUVWXYZ"
But this does work
gsub("^(.+?(?=__)).*$", "\\1", myExampleString, perl = TRUE)
"abcdefghijklm_nopqrstuvwxyz"
The difference is the greedy-modifier "?" after the wildcard ".+"
in the (perl) regular expression.

Regex to catch all files but those starting with "."

In a directory with mixed content such as:
.afile
.anotherfile
bfile.file
bnotherfile.file
.afolder/
.anotherfolder/
bfolder/
bnotherfolder/
How would you catch everything but the files (not dirs) starting with .?
I have tried with a negative lookahead ^(?!\.).+? but it doesn't seem to work right.
Please note that I would like to avoid doing it by excluding the . by using [a-zA-Z< plus all other possible chars minus the dot >]
Any suggestions?
This should do it:
^[^.].*$
[^abc] will match anything that is not a, b or c
Escaping .and negating the characters that can start the name you have:
^[^\.].*$
Tested successfully with your test cases here.
The negative lookahead ^(?!\.).+$ does work. Here it is in Java:
String[] files = {
".afile",
".anotherfile",
"bfile.file",
"bnotherfile.file",
".afolder/",
".anotherfolder/",
"bfolder/",
"bnotherfolder/",
"",
};
for (String file : files) {
System.out.printf("%-18s %6b%6b%n", file,
file.matches("^(?!\\.).+$"),
!file.startsWith(".")
);
}
The output is (as seen on ideone.com):
.afile false false
.anotherfile false false
bfile.file true true
bnotherfile.file true true
.afolder/ false false
.anotherfolder/ false false
bfolder/ true true
bnotherfolder/ true true
false true
Note also the use of the non-regex String.startsWith. Arguably this is the best, most readable solution, because regex is not needed anyway, and startsWith is O(1) where as the regex (at least in Java) is O(N).
Note the disagreement on the blank string. If this is a possible input, and you want this to return false, you can write something like this:
!file.isEmpty() && !file.startsWith(".")
See also
Is regex too slow? Real life examples where simple non-regex alternative is better
In Java, .* even in Pattern.DOTALL mode takes O(N) to match.
Uhm... how about a negative character class?
[^.]
to exclude the dot?