regexp to include underscore - regex

I'm new to using to regexp. I have the following:
files = c("apple_2014_10_17.csv",
"apple_2014_10_18.csv",
"applepie_2014_10_17.csv",
"applepie_2014_10_18.csv")
I am looking to return only:
apple_2014_10_17.csv
apple_2014_10_18.csv
and NOT return:
applepie_2014_10_17.csv
applepie_2014_10_18.csv
I'm using the following regexp
grepl("apple_*", files)
But it returns all the files. Any assistance would be greatly appreciated.

You could simply remove the * quantifier. The problem is that this quantifer means "zero or more" times and will match apple in all vector elements whether you have an underscore that precedes or not.
files[grepl('apple_', files)]
# [1] "apple_2014_10_17.csv" "apple_2014_10_18.csv"
Or you could retain the quantifier and just place a dot . in front of it. This way apple_ is matched literally and then the preceding token (. any single character) is matched "zero or more" times instead.
files[grepl('apple_.*', files)]
# [1] "apple_2014_10_17.csv" "apple_2014_10_18.csv"

You can also use the value argument in grep and not have to subset files. The fixed argument of grep (and grepl) will make the matching exact and since this does not pass through the regex engine often times will make it faster.
grep("apple_", files, value = TRUE, fixed = TRUE)
# [1] "apple_2014_10_17.csv" "apple_2014_10_18.csv"
Or easier might be to use the invert argument and search for "pie", returning the opposite matches.
grep("pie", files, value = TRUE, invert = TRUE)
# [1] "apple_2014_10_17.csv" "apple_2014_10_18.csv"
Note that if you're searching for files in a directory, you can also try
list.files(pattern = "apple_")

Related

Shorten Regular Expression (\n) [duplicate]

I'd like to match three-character sequences of letters (only letters 'a', 'b', 'c' are allowed) separated by comma (last group is not ended with comma).
Examples:
abc,bca,cbb
ccc,abc,aab,baa
bcb
I have written following regular expression:
re.match('([abc][abc][abc],)+', "abc,defx,df")
However it doesn't work correctly, because for above example:
>>> print bool(re.match('([abc][abc][abc],)+', "abc,defx,df")) # defx in second group
True
>>> print bool(re.match('([abc][abc][abc],)+', "axc,defx,df")) # 'x' in first group
False
It seems only to check first group of three letters but it ignores the rest. How to write this regular expression correctly?
Try following regex:
^[abc]{3}(,[abc]{3})*$
^...$ from the start till the end of the string
[...] one of the given character
...{3} three time of the phrase before
(...)* 0 till n times of the characters in the brackets
What you're asking it to find with your regex is "at least one triple of letters a, b, c" - that's what "+" gives you. Whatever follows after that doesn't really matter to the regex. You might want to include "$", which means "end of the line", to be sure that the line must all consist of allowed triples. However in the current form your regex would also demand that the last triple ends in a comma, so you should explicitly code that it's not so.
Try this:
re.match('([abc][abc][abc],)*([abc][abc][abc])$'
This finds any number of allowed triples followed by a comma (maybe zero), then a triple without a comma, then the end of the line.
Edit: including the "^" (start of string) symbol is not necessary, because the match method already checks for a match only at the beginning of the string.
The obligatory "you don't need a regex" solution:
all(letter in 'abc,' for letter in data) and all(len(item) == 3 for item in data.split(','))
You need to iterate over sequence of found values.
data_string = "abc,bca,df"
imatch = re.finditer(r'(?P<value>[abc]{3})(,|$)', data_string)
for match in imatch:
print match.group('value')
So the regex to check if the string matches pattern will be
data_string = "abc,bca,df"
match = re.match(r'^([abc]{3}(,|$))+', data_string)
if match:
print "data string is correct"
Your result is not surprising since the regular expression
([abc][abc][abc],)+
tries to match a string containing three characters of [abc] followed by a comma one ore more times anywhere in the string. So the most important part is to make sure that there is nothing more in the string - as scessor suggests with adding ^ (start of string) and $ (end of string) to the regular expression.
An alternative without using regex (albeit a brute force way):
>>> def matcher(x):
total = ["".join(p) for p in itertools.product(('a','b','c'),repeat=3)]
for i in x.split(','):
if i not in total:
return False
return True
>>> matcher("abc,bca,aaa")
True
>>> matcher("abc,bca,xyz")
False
>>> matcher("abc,aaa,bb")
False
If your aim is to validate a string as being composed of triplet of letters a,b,and c:
for ss in ("abc,bbc,abb,baa,bbb",
"acc",
"abc,bbc,abb,bXa,bbb",
"abc,bbc,ab,baa,bbb"):
print ss,' ',bool(re.match('([abc]{3},?)+\Z',ss))
result
abc,bbc,abb,baa,bbb True
acc True
abc,bbc,abb,bXa,bbb False
abc,bbc,ab,baa,bbb False
\Z means: the end of the string. Its presence obliges the match to be until the very end of the string
By the way, I like the form of Sonya too, in a way it is clearer:
bool(re.match('([abc]{3},)*[abc]{3}\Z',ss))
To just repeat a sequence of patterns, you need to use a non-capturing group, a (?:...) like contruct, and apply a quantifier right after the closing parenthesis. The question mark and the colon after the opening parenthesis are the syntax that creates a non-capturing group (SO post).
For example:
(?:abc)+ matches strings like abc, abcabc, abcabcabc, etc.
(?:\d+\.){3} matches strings like 1.12.2., 000.00000.0., etc.
Here, you can use
^[abc]{3}(?:,[abc]{3})*$
^^
Note that using a capturing group is fraught with unwelcome effects in a lot of Python regex methods. See a classical issue described at re.findall behaves weird post, for example, where re.findall and all other regex methods using this function behind the scenes only return captured substrings if there is a capturing group in the pattern.
In Pandas, it is also important to use non-capturing groups when you just need to group a pattern sequence: Series.str.contains will complain that this pattern has match groups. To actually get the groups, use str.extract. and
the Series.str.extract, Series.str.extractall and Series.str.findall will behave as re.findall.

Find repeated words in a string separated by "/"

Assume the following vector:
x <- c("/default/img/irs/irs/irs/irs/irs/irs/irs/irs/irs/irs/irs/irs/IRS.html/", "something/repeat/repeat_this")
I want to check whether a word enclosed by / is repeated (Note that / might be missing from start and end of string). I found the following brilliant piece of regex here but (after I strip special characters) I can't seem to modify it to fit my case:
grepl("\\b(\\S+?)\\1\\S*\\b", x, perl = TRUE)
# [1] TRUE TRUE
I can always str_split(x, "/") and iterate the duplicated() function over the list and use an if() statement but that would be terribly inefficient.
Desired outcome should be a vector with TRUE or FALSE (or 1 and 0).
Other solution if you only want to check your pattern
grepl(x, pattern = "((.+)/).*(/\\2(/|$))", perl=T)
where (.+)represents the word itself (capture group 2) appearing before a slash, the .* allows an arbitrary length of characters, digits and whitespaces to occur between two equal substrings. (/\\2(/|$)) then matches if the word occurs after a slash followed by either another slash or the end of the string ($).
For extraction you can use strsplit() as elaborated above.
I think the following could work for you. First, fixed = TRUE in strsplit() bypasses the regex engine and goes straight to exact matching, making the function much faster. Next, anyDuplicated() returns a length one integer result which will be zero if no duplicates are found, and greater than zero otherwise. So we can split the string with strsplit() and iterate anyDuplicated() over the result. Then we can compare the resulting vector with zero.
vapply(strsplit(x, "/", fixed = TRUE), anyDuplicated, 1L) > 0L
# [1] TRUE FALSE
To be safe, you may want to remove any leading /, since it will produce an empty character in the result from strsplit() and could produce misleading results in some cases (e.g. cases where the string begins with a / and irs//irs or similar occurs later in the string). You can remove leading forward slashes with sub("^/", "", x).
In summary, the ways to make your strsplit() idea faster are:
use fixed = TRUE in strsplit() to bypass the regex engine
use anyDuplicated() since it stops looking after it finds one match
use vapply() since we know what the result type and length will be

R Wildcard matching for certain number of terms

Suppose I have a string and am searching for particular wildcard terms. For example:
x <- "AJSDKLAFJASFJABJKADL"
z <- stri_locate_all_regex(x, 'A*****AF')
I want to search for all terms that have any 5 characters in between A and AF, like ABJDKAAF or AJSDKLAF... However the above code does not work. Is there a simple way to do this that I am overlooking? Thank you!
In regular expressions (as opposed to standard wildcards that you might be used to), * means "0 or more of the preceding character", so "A*" means "0 or more A". You can't stack them like '****', for that you want '.' which means "one character".
z <- stri_locate_all_regex(x, 'A.....AF')
TL,DR: regex problem, not R problem.
For a simple way to do this, and by this I assume you mean that you want to use your wildcard characters as in the question, you can turn these into proper regular expressions using glob2rx(). A "wildcard" expression, also known as a "glob", is a sort of poor man's regular expression (?regex). For your expression, you can specify five ? characters, because in a glob, ? means any single character.
x <- c("ABCDEFAF", "XABCDEFAFX", "abcdeaf", "A55555AF", "A666666AF")
# the (simpler?) "wildcard" way
stringi::stri_detect_regex(x, glob2rx("A?????AF"))
## [1] TRUE FALSE FALSE TRUE FALSE
# the regular expression way (probably WRONG)
stringi::stri_detect_regex(x, "A.{5}AF")
## [1] TRUE TRUE FALSE TRUE FALSE
# the regular expression way (CORRECT)
stringi::stri_detect_regex(x, "^A.{5}AF$")
## [1] TRUE FALSE FALSE TRUE FALSE
This returns a logical vector if the wildcard matches.
By contrast, stri_locate_all_regex() returns a list of matrixes of dimensions 1, 2 where the columns are the starting and ending character positions of the matches within the string, or a pair of NA values if the pattern is not found.
Note that one of the differences in your wildcard/glob expression is that to get A + any five characters + AF without any preceding or trailing characters, you would need to specify the regular expression characters for the start and end of the string, as per above. Otherwise the match picks up "XABCDEFAFX" too. For a wildcard/glob, this is not a problem since the start and end of the expression match the beginning and end of the string:
> glob2rx("A?????AF")
[1] "^A.....AF$"

regular expression -- greedy matching?

I am trying to extract a leading string by stripping off an optional trailing string, where the trailing strings are a subset of possible leading strings but not vice versa. Suppose the leading string is like [a-z]+ and the trailing string is like c. Thus from "abc" I want to extract "ab", and from "ab" I also want to get "ab". Something like this:
^([a-z]+)(?:c|)
The problem is that the [a-z]+ matches the entire string, using the empty option in the alternative, so the grabbed value is "abc" or "ab". (The (?: tells it not to grab the second part.) I want some way to make it take the longer option, or the first option, in the alternative, and use that to determine what matches the first part.
I have also tried putting the desired target inside both of the alternatives:
^([a-z]+)c|^([a-z]+)
I think that it should prefer to match the first one of the two possible alternatives, but I get the same results as above.
I am doing this in R, so I can use either the POSIX or the Perl regex library.
(The actual problem involves futures trading symbols. These have a root "instrument name" like [A-Z0-9]+, followed by an "expiration code" like [FGHJKMNQUVXZ][0-9]{1,2}. Given a symbol like "ZNH3", I want to strip the "H3" to get "ZN". But if I give it "ZN" I also want to get back "ZN".)
Try this:
> library(gsubfn)
> strapplyc(c("abc", "abd"), "^(\\w+?)c?$", simplify = TRUE)
[1] "ab" "abd"
and even easier:
> sub("c$", "", c("abc", "abd"))
[1] "ab" "abd"
Here's a working regular expression:
vec <- c("ZNH3", "ZN", "ZZZ33", "ABF")
sub("(\\w+)[FGHJKMNQUVXZ]\\d{1,2}", "\\1", vec)
# [1] "ZN" "ZN" "ZZ" "ABF"
A variation on the non-greedy answers using base code only.
codes <- c("ZNH3", "CLZ4")
matched <- regmatches(codes, regexec("^([A-Z0-9]+?)[FGHJKMNQUVXZ][0-9]{1,2}$", codes))
# [[1]]
# [1] "ZNH3" "ZN"
#
# [[2]]
# [1] "CLZ4" "CL"
sapply(matched, `[[`, 2) # extract just codes
# [1] "ZN" "CL"
Use a 'non-greedy' match for the first part of the regex, followed by the definitions of your 'optional allowed suffixes' anchored by the 'end-of-string'...
This regex (.+?)([FGHJKMNQUVXZ][0-9]{1,2})?$matches...
(.+?) as few characters as possible
([FGHJKMNQUVXZ][0-9]{1,2})? followed by an allowable (but optional) suffix
$ followed by the end of string
The required result is in the first captured element of the match (however that may be referenced in 'r') :-)

Regular Expression issue with * laziness

Sorry in advance that this might be a little challenging to read...
I'm trying to parse a line (actually a subject line from an IMAP server) that looks like this:
=?utf-8?Q?Here is som?= =?utf-8?Q?e text.?=
It's a little hard to see, but there are two =?/?= pairs in the above line. (There will always be one pair; there can theoretically be many.) In each of those =?/?= pairs, I want the third argument (as defined by a ? delimiter) extracted. (In the first pair, it's "Here is som", and in the second it's "e text.")
Here's the regex I'm using:
=\?(.+)\?.\?(.*?)\?=
I want it to return two matches, one for each =?/?= pair. Instead, it's returning the entire line as a single match. I would have thought that the ? in the (.*?), to make the * operator lazy, would have kept this from happening, but obviously it doesn't.
Any suggestions?
EDIT: Per suggestions below to replace ".?" with "[^(\?=)]?" I'm now trying to do:
=\?(.+)\?.\?([^(\?=)]*?)\?=
...but it's not working, either. (I'm unsure whether [^(\?=)]*? is the proper way to test for exclusion of a two-character sequence like "?=". Is it correct?)
Try this:
\=\?([^?]+)\?.\?(.*?)\?\=
I changed the .+ to [^?]+, which means "everything except ?"
A good practice in my experience is not to use .*? but instead do use the * without the ?, but refine the character class. In this case [^?]* to match a sequence of non-question mark characters.
You can also match more complex endmarkers this way, for instance, in this case your end-limiter is ?=, so you want to match nonquestionmarks, and questionmarks followed by non-equals:
([^?]*\?[^=])*[^?]*
At this point it becomes harder to choose though. I like that this solution is stricter, but readability decreases in this case.
One solution:
=\?(.*?)\?=\s*=\?(.*?)\?=
Explanation:
=\? # Literal characters '=?'
(.*?) # Match each character until find next one in the regular expression. A '?' in this case.
\?= # Literal characters '?='
\s* # Match spaces.
=\? # Literal characters '=?'
(.*?) # Match each character until find next one in the regular expression. A '?' in this case.
\?= # Literal characters '?='
Test in a 'perl' program:
use warnings;
use strict;
while ( <DATA> ) {
printf qq[Group 1 -> %s\nGroup 2 -> %s\n], $1, $2 if m/=\?(.*?)\?=\s*=\?(.*?)\?=/;
}
__DATA__
=?utf-8?Q?Here is som?= =?utf-8?Q?e text.?=
Running:
perl script.pl
Results:
Group 1 -> utf-8?Q?Here is som
Group 2 -> utf-8?Q?e text.
EDIT to comment:
I would use the global modifier /.../g. Regular expression would be:
/=\?(?:[^?]*\?){2}([^?]*)/g
Explanation:
=\? # Literal characters '=?'
(?:[^?]*\?){2} # Any number of characters except '?' with a '?' after them. This process twice to omit the string 'utf-8?Q?'
([^?]*) # Save in a group next characters until found a '?'
/g # Repeat this process multiple times until end of string.
Tested in a Perl script:
use warnings;
use strict;
while ( <DATA> ) {
printf qq[Group -> %s\n], $1 while m/=\?(?:[^?]*\?){2}([^?]*)/g;
}
__DATA__
=?utf-8?Q?Here is som?= =?utf-8?Q?e text.?= =?utf-8?Q?more text?=
Running and results:
Group -> Here is som
Group -> e text.
Group -> more text
Thanks for everyone's answers! The simplest expression that solved my issue was this:
=\?(.*?)\?.\?(.*?)\?=
The only difference between this and my originally-posted expression was the addition of a ? (non-greedy) operator on the first ".*". Critical, and I'd forgotten it.