Regex in R to limit one term AND another not OR another

Regex in R to limit one term AND another not OR another - regex

I am trying to extract records from a data.frame using grepl. Here are some example cases.
a <- c('This is a healthcare facility', 'this is a hospital', 'this is a hospital district', 'this is a district health service')
I wish to extract all records that have hospital but not district. I have come unstuck when district and hospital occur in the same string. I tried using the dollowing:
str_match(string=a,pattern='hospital|^district' )
How do I limit district but still include hospital in this example?
Thanks.

You need to use the symbol & for AND, ! for NOT, with two grepl calls:
grepl("hospital", a) & !grepl("district", a)
# [1] FALSE TRUE FALSE FALSE
a[.Last.value]
# [1] "this is a hospital"

You could use two calls to grepl:
a[grepl("hospital", a) & !grepl("district", a)]
# [1] "this is a hospital"

R supports Perl-compatible regular expressions, which allow negative lookahead assertions, so in principle, you can write:
str_match(string=a, pattern='^(?!.*district).*hospital', perl=TRUE)
(which matches "start-of-string, followed by a point in the string that is not followed by .*district, followed by .*hospital"). That said, I'm really not sure if putting this condition into a single regex is the best way to do it; there may be a more R-ish way.

Related

agrepl does not work with regular expressions

I need to get partial matching in a string using regexs. I can get exact ones:
pattern <- "(^| )shower only($| )"
stringInQuestion<-"Delta Vero 1-Handle Shower Only Faucet Trim Kit in Chrome"
grepl(pattern,stringInQuestion, ignore.case=TRUE,perl=TRUE)
[1] TRUE
agrepl(pattern,stringInQuestion, ignore.case=TRUE,fixed = FALSE, max.distance=0.2)
[1] FALSE
Works only for plain character strings:
agrepl("shower only",stringInQuestion, ignore.case=TRUE,fixed = FALSE, max.distance=0.2)
Can somebody please help me to figure out what is going on?

Since you intend to just check for a whole word presence, I suggest reducing the pattern to
pattern <- "\\bshower only\\b"
See the official description of the max.distance argument:
max.distance
Maximum distance allowed for a match. Expressed either as integer, or as a fraction of the pattern length times the maximal transformation cost (will be replaced by the smallest integer not less than the corresponding fraction), or a list with possible components
0.2 will allow matching the phrase with errors, say Showerrrrr Only, but won't match Showerrrrrr Only. See this working demo:
pattern <- "\\bshower only\\b"
stringInQuestion<-"Delta Vero 1-Handle Shower Only Faucet Trim Kit in Chrome"
grepl(pattern,stringInQuestion, ignore.case=TRUE,perl=TRUE)
agrepl(pattern,stringInQuestion, ignore.case=TRUE,fixed = FALSE, max.distance=0.2)
## [1] TRUE
## [1] TRUE
Note that the max.distance should be tested against the real input you have and asjust accordingly.

How to add an exception to a search query in R using grepl

I have a variable consisting of different group names, which are responsible for terrorist incidents (observations).
I would like to exclude all observations, where this variable includes the word "Communist" e.g. exclude all cases where groupname = "Bangladesh Communist Party" etc. Here is my code for doing this:
newdata <- olddata[!grepl("Communist", olddata$groupname),]
But I want to add an exception to this rule: all the "Anti-Communist" groups should remain in the data frame. So the code should remove "Bangladesh Communist Party" but leave e.g. "Anti-Communist Rebels".
Do I use regular expressions? Or is there a way to add an exception to this kind of pattern matching? I guess it should look something like this at the end:
newdata <- olddata[!grepl("Communist"[but exclude "Anti-Communist"], olddata&groupname),]
Thanks!

You can use a negative look behind:
x <- c("Bangladesh Communist Party", "Anti-Communist Rebels")
!grepl("(?<!Anti-)Communist", x, perl = TRUE)
# [1] FALSE TRUE

How to split array of strings from two sides?

I have an array of strings (n=1000) in this format:
strings<-c("GSM1264936_2202_4866_28368_150cGy-GCSF6-m3_Mouse430A+2.CEL.gz",
"GSM1264937_2202_4866_28369_150cGy-GCSF6-m4_Mouse430A+2.CEL.gz",
"GSM1264938_2202_4866_28370_150cGy-GCSF6-m5_Mouse430A+2.CEL.gz")
I'm wondering what may be a easy way to get this:
strings2<-c(2201_4866_28368_150cGy-GCSF6-m3_Mouse430A+2.CEL,
2202_4866_28369_150cGy-GCSF6-m4_Mouse430A+2.CEL,
2203_4866_28370_150cGy-GCSF6-m5_Mouse430A+2.CEL)
which means to trim off "GSM1234567" from the front and ".gz" from the end.

Just a gsub solution that matches strings that starts ^ with digits and alphabetical symbols, zero or more times *, until a _ is encountered and (more precisely "or") pieces or strings that have .gz at the end $.
gsub("^([[:alnum:]]*_)|(\\.gz)$", "", strings)
[1] "2202_4866_28368_150cGy-GCSF6-m3_Mouse430A+2.CEL"
[2] "2202_4866_28369_150cGy-GCSF6-m4_Mouse430A+2.CEL"
[3] "2202_4866_28370_150cGy-GCSF6-m5_Mouse430A+2.CEL"
Edit
I forget to escape the second point.

strings <- c("GSM1264936_2202_4866_28368_150cGy-GCSF6-m3_Mouse430A+2.CEL.gz", "GSM1264937_2202_4866_28369_150cGy-GCSF6-m4_Mouse430A+2.CEL.gz", "GSM1264938_2202_4866_28370_150cGy-GCSF6-m5_Mouse430A+2.CEL.gz")
strings2 <- lapply(strings, function (x) substr(x, 12, 58))

You can do this using sub:
sub('[^_]+_(.*)\\.gz', '\\1', strings)
# [1] "2202_4866_28368_150cGy-GCSF6-m3_Mouse430A+2.CEL"
# [2] "2202_4866_28369_150cGy-GCSF6-m4_Mouse430A+2.CEL"
# [3] "2202_4866_28370_150cGy-GCSF6-m5_Mouse430A+2.CEL"

Try:
gsub('^[^_]+_|\\.[^.]*$','',strings)

I strongly suggest doing this in two steps. The other solutions work but are completely unreadable: they don’t express the intent of your code. Here it is, clearly expressed:
trimmed_prefix = sub('^GSM\\d+_', '', strings)
strings2 = sub('\\.gz$', '', trimmed_prefix)
But admittedly this can be expressed in one step, and wouldn’t look too badly, as follows:
strings2 = sub('^GSM\\d+_(.*)\\.gz$', '\\1', strings)
In general, think carefully about the patterns you actually want to match: your question says to match the prefix “GSM1234567” but your example contradicts that. I’d generally choose a pattern that’s as specific as possible to avoid accidentally matching faulty input.

Extract a string of words between two specific words in R [duplicate]

This question already has answers here:
Extracting a string between other two strings in R
(4 answers)
Closed last year.
I have the following string : "PRODUCT colgate good but not goodOKAY"
I want to extract all the words between PRODUCT and OKAY

This can be done with sub:
s <- "PRODUCT colgate good but not goodOKAY"
sub(".*PRODUCT *(.*?) *OKAY.*", "\\1", s)
giving:
[1] "colgate good but not good"
No packages are needed.
Here is a visualization of the regular expression:
.*PRODUCT *(.*?) *OKAY.*
Debuggex Demo

x = "PRODUCT colgate good but not goodOKAY"
library(stringr)
str_extract(string = x, pattern = "(?<=PRODUCT).*(?=OKAY)")
(?<=PRODUCT) -- look behind the match for PRODUCT
.* match everything except new lines.
(?=OKAY) -- look ahead to match OKAY.
I should add you don't need the stringr package for this, the base functions sub and gsub work fine. I use stringr for it's consistency of syntax: whether I'm extracting, replacing, detecting etc. the function names are predictable and understandable, and the arguments are in a consistent order. I use stringr because it saves me from needing the documentation every time.
(Note that for stringr versions less than 1.1.0, you need to specify perl-flavored regex to get lookahead and lookbehind functionality - so the pattern above would need to be wrapped in perl().)

You can use gsub:
vec <- "PRODUCT colgate good but not goodOKAY"
gsub(".*PRODUCT\\s*|OKAY.*", "", vec)
# [1] "colgate good but not good"

You could use the rm_between function from the qdapRegex package. It takes a string and a left and right boundary as follows:
x <- "PRODUCT colgate good but not goodOKAY"
library(qdapRegex)
rm_between(x, "PRODUCT", "OKAY", extract=TRUE)
## [[1]]
## [1] "colgate good but not good"

You could use the package unglue :
library(unglue)
x <- "PRODUCT colgate good but not goodOKAY"
unglue_vec(x, "PRODUCT {out}OKAY")
#> [1] "colgate good but not good"

Regular Expressions for City name

I need a regular Expression for Validating City textBox, the city textbox field accepts only Letters, spaces and dashes(-).

This answer assumes that the letters which #Manaysah refers to also encompasses the use of diacritical marks. I've added the single quote ' since many names in Canada and France have it. I've also added the period (dot) since it's required for contracted names.
Building upon #UIDs answer I came up with,
^([a-zA-Z\u0080-\u024F]+(?:. |-| |'))*[a-zA-Z\u0080-\u024F]*$
The list of cities it accepts:
Toronto
St. Catharines
San Fransisco
Val-d'Or
Presqu'ile
Niagara on the Lake
Niagara-on-the-Lake
München
toronto
toRonTo
villes du Québec
Provence-Alpes-Côte d'Azur
Île-de-France
Kópavogur
Garðabær
Sauðárkrókur
Þorlákshöfn
And what it rejects:
A----B
------
*******
&&
()
//
\\
I didn't add in the use of brackets and other marks since it didn't fall within the scope of this question.
I've stayed away from \s for whitespace. Tabs and line feeds aren't part of a city name and shouldn't be used in my opinion.

This can be arbitrarily complex, depending on how precise you need the match to be, and the variation you're willing to allow.
Something fairly simple like ^[a-zA-Z]+(?:[\s-][a-zA-Z]+)*$ should work.
warning: This does not match cities like München, etc, but here you basically need to work with the [a-zA-Z] part of the expression, and define what characters are allowed for your particular case.
Keep in mind that it also allows for something like San----Francisco, or having several spaces.
Translates to something like:
1 or more letters, followed by a block of: 0 or more spaces or dashes and more letters, this last block can occur 0 or more times.
Weird stuff in there: the ?: bit. If you're not familiarized with regexes, it might be confusing, but that simply states that the piece of regex between parenthesis, is not a capturing group (I don't want to capture the part it matches to reuse later), so the parenthesis are only used as to group the expression (and not to capture the match).
"New York" // passes
"San-Francisco" // passes
"San Fran Cisco" // passes (sorry, needed an example with three tokens)
"Chicago" // passes
" Chicago" // doesn't pass, starts with spaces
"San-" // doesn't pass, ends with a dash

Adding my answer if anybody needs its while searching for Regex for City Names, Like I did
Please use this :
^[a-zA-Z\u0080-\u024F\s\/\-\)\(\`\.\"\']+$
As many city names contains dashes, such as Soddy-Daisy, Tennessee, or special characters like, ñ in La Cañada Flintridge, California
Hope this helps!

Here is the one I've found works best
for PCRE flavours allowing \p{L} (.NET, php, Golang)
/^\p{L}+(?:([\ \-\']|(\.\ ))\p{L}+)*$/u
for regex that does not allow \p{L} replace it with [a-zA-Z\u0080-\u024F]
so for javascript, python regex use
/^[a-zA-Z\u0080-\u024F]+(?:([\ \-\']|(\.\ ))[a-zA-Z\u0080-\u024F]+)*$/
White listing a bunch of character is easy, but there are things to watch for in your regex
consecutive non-alphabetical characters should not be allowed. i.e. Los Angeles should fail because it has two spaces
periods should have a space after. i.e. St.Albert should fail because it's missing the space
names cannot start or end with non-alphabetical characters i.e. -Chicago- should fail
a whitespace character \s !== \, i.e. a tab and line feed character could pass, so space character should be defined instead
Note: When building regex rules, I find https://regex101.com/tests is very helpful, as you can easily create unit tests
js: https://regex101.com/r/cgJwc0/1/tests
php: https://regex101.com/r/Yo3GV2/1/tests

Here's one that will work with most cities, and has been tested:
^[a-zA-Z\u0080-\u024F]+(?:. |-| |')*([1-9a-zA-Z\u0080-\u024F]+(?:. |-| |'))*[a-zA-Z\u0080-\u024F]*$
Python code below, including its test.
import re
import pytest
CITY_RE = re.compile(
r"^[a-zA-Z\u0080-\u024F]+(?:. |-| |')*" # a word
r"([1-9a-zA-Z\u0080-\u024F]+(?:. |-| |'))*"
r"[a-zA-Z\u0080-\u024F]*$"
)
def is_city(value: str) -> bool:
valid = CITY_RE.match(value) is not None
return valid
# Tests
#pytest.mark.parametrize(
"value,expected",
(
("1", False),
("Toronto", True),
("Saint-Père-en-Retz", True),
("Saint Père en Retz", True),
("Saint-Père en Retz", True),
("Paris 13e Arrondissement", True),
("Paris 13e Arrondissement ", True),
("Bouc-Étourdi", True),
("Arnac-la-Poste", True),
("Bourré", True),
("Å", True),
("San Francisco", True),
),
)
def test_is_city(value, expected):
valid, msg = validate.is_city(value)
assert valid is expected

^[a-zA-Z\- ]+$
Also this might be useful http://www.cheatography.com/davechild/cheat-sheets/regular-expressions/

use this regex:
^[a-zA-Z-\s]+$

After many hours of looking for a city regex matcher I have built this and it meets my needs 100%
(?ix)^[A-Z.-]+(?:\s+[A-Z.-]+)*$
expression for testing city.
Matches
City
St. City
Some Silly-City
City St.
Too Many Words City
it seems that there are many flavors of regex and I built this for my Java needs and it works great

^[a-zA-Z.-]+(?:[\s-][\/a-zA-Z.]+)*$
This will help identify some city names like St. Johns, Baie-Sainte-Anne, Grand-Salut/Grand Falls

I like shepley's suggestion, but it has a couple flaws in it.
If you change shpeley's regex to this, it will not accept other special characters:
^([a-zA-Z\u0080-\u024F]{1}[a-zA-Z\u0080-\u024F\. |\-| |']*[a-zA-Z\u0080-\u024F\.']{1})$

I use that one:
^[a-zA-Z\\u0080-\\u024F.]+((?:[ -.|'])[a-zA-Z\\u0080-\\u024F]+)*$

You can try this:
^\p{L}+(?:[\s\-]\p{L}+)*
The above regex will:
Restrict leading and trailing spaces, hyphens
Match cities with names like Néewiller-près-lauterbourg

Here are some fun edge-cases:
's Graveland
's Gravendeel
's Gravenpolder
's Gravenzande
's Heer Arendskerke
's Heerenberg
's Heerenhoek
's Hertogenbosch
't Harde
't Veld
't Zand
100 Mile House
6 October City
So, don't forget to add ' and 0-9 as a possible first character of the city name.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Regex in R to limit one term AND another not OR another - regex

You need to use the symbol & for AND, ! for NOT, with two grepl calls: grepl("hospital", a) & !grepl("district", a) # [1] FALSE TRUE FALSE FALSE a[.Last.value] # [1] "this is a hospital"

You could use two calls to grepl: a[grepl("hospital", a) & !grepl("district", a)] # [1] "this is a hospital"

Related

agrepl does not work with regular expressions

How to add an exception to a search query in R using grepl

How to split array of strings from two sides?

Extract a string of words between two specific words in R [duplicate]

Regular Expressions for City name

Categories

Resources