Regex to reject only nonalphanumeric characters - regex

If the keyword to be checked is other. It should not be preceded or followed by alphanumeric character.
spaces are allowed, \n allowed, Special characters allowed.
Not allowed - "AOther9", "noTHERX"
Allowed - "other", "\nother" , " other ", "$other/"
grepl(paste("[^a-zA-Z0-9]","other","[^a-zA-Z0-9]",sep=""),String1 , ignore.case = TRUE)
The above regex works well for all cases other than “check” - when check is preceded and followed by nothing.

You need to use a PCRE regex with lookarounds:
grepl(paste("(?<![a-zA-Z0-9])","other","(?![a-zA-Z0-9])",sep=""), String1, ignore.case = TRUE, perl=TRUE)
^^^^ ^ ^^^ ^ ^^^^^^^^^
The negative lookarounds will not consume the non-alphanumeric characters, they do not require those characters to actually be present in the string.
You can read more about lookarounds here.

Add a * quantifier to the inverted ranges, and start ^ and end $ of line anchors:
String1 <- c("AOther9", "noTHERX", "other", "\nother", " other ", "$other/")
grep('^[^a-z0-9]*other[^a-z0-9]*$', String1, ignore.case = TRUE, value = TRUE)
# [1] "other" "\nother" " other " "$other/"

Related

A regular expression to replace different combinations of double quotes inside a double-quoted string

A regular expression to replace different combinations of double quotes inside a double-quoted string.
Can't clear JSON with one regular expression (for PCRE). I just don't know what to do next.
("title":")[\s\S]+(", "partid":)
I've tried various search and replacement options. For example, ("title":"[^"])(")([^"])(")(, "p) $1$3$4$5, then the same for two double quotes, for three, and so on.
Examples of strings:
{ "DT_RowId":"c2a839fb-580a-11e8-bac6-00155d080416", **"title":"Гайка 7/16"-14" UNC топорна;14H813;P88344 12""**, "partid":"S.4964", "manufacturerid":"2a7dc482-af13-11de-88d3-00e081b05e17", "manufacturer":"SPAREX", "quantity":">10", "price":"8.93", "actionprice":"", "rep":1, "img":0 } , { "DT_RowId":"05d8b40c-ec93-11dd-8f72-00e081b05e05", "title":"Нож ротора (зам.501060)", "partid":"501063", "manufacturerid":"3a7e891f-07ba-11de-8a95-00e081b05e17", "manufacturer":"Geringhoff", "quantity":">10", "price":"932.27", "actionprice":"584.90", "rep":1, "img":1 } , { "DT_RowId":"b7c6c9ee-adca-11e3-8202-00155d012119", **"title":"Олива моторна "CASTROL VECTON" 10W40 E4"/E7", 208L"**, "partid":"RB-V14E4E7-208L", "manufacturerid":"763d805e-c53b-11de-9210-00e081b05e05", "manufacturer":"CASTROL", "quantity":">10", "price":"111.60", "actionprice":"", "rep":1, "img":1 } , { "DT_RowId":"05d8b41d-ec93-11dd-8f72-00e081b05e05", **"title":"Н""о"ж"**, "partid":"501251", "manufacturerid":"3a7e891f-07ba-11de-8a95-00e081b05e17", "manufacturer":"Geringhoff", "quantity":">10", "price":"719.45", "actionprice":"", "rep":1, "img":1 }
Please help. Please help. How can I remove or escape double quotes between "title":" and ", "partid":
You may use
(?:\G(?!\A)|"title":").*?\K"(?=.*?"\s*,\s*"partid":)
Replace with an empty string. See the regex demo.
Details
(?:\G(?!\A)|"title":") - end of the previous match or "title":" string
.*? - any 0+ chars, other than linebreak chars, as few as possible
\K - a match reset operator
" - a " char
(?=.*?"\s*,\s*"partid":) - followed with any 0+ chars, other than linebreak chars, as few as possible, ", 0+ whitespaces, ,, 0+ whitespaces and "partid":.

replace every other space with new line

I have strings like this:
a <- "this string has an even number of words"
b <- "this string doesn't have an even number of words"
I want to replace every other space with a new line. So the output would look like this...
myfunc(a)
# "this string\nhas an\neven number\nof words"
myfunc(b)
# "this string\ndoesn't have\nan even\nnumber of\nwords"
I've accomplished this by doing a strsplit, paste-ing a newline on even numbered words, then paste(a, collapse=" ") them back together into one string. Is there a regular expression to use with gsub that can accomplish this?
#Jota suggested a simple and concise way:
myfunc = function(x) gsub("( \\S+) ", "\\1\n", x) # Jota's
myfunc2 = function(x) gsub("([^ ]+ [^ ]+) ", "\\1\n", x) # my idea
lapply(list(a,b), myfunc)
[[1]]
[1] "this string\nhas an\neven number\nof words"
[[2]]
[1] "this string\ndoesn't have\nan even\nnumber of\nwords"
How it works. The idea of "([^ ]+ [^ ]+) " regex is (1) "find two sequences of words/nonspaces with a space between them and a space after them" and (2) "replace the trailing space with a newline".
#Jota's "( \\S+) " is trickier -- it finds any word with a space before and after it and then replaces the trailing space with a newline. This works because the first word that is caught by this is the second word of the string; and the next word caught by it is not the third (since we have already "consumed"/looked at the space in front of the third word when handling the second word), but rather the fourth; and so on.
Oh, and some basic regex stuff.
[^xyz] means any single char except the chars x, y, and z.
\\s is a space, while \\S is anything but a space
x+ means x one or more times
(x) "captures" x, allowing for reference in the replacement, like \\1

Subdivide an expression into alternative subpattern - using gsub()

I'm trying to subdivide my metacharacter expression in my gsub() function. But it does not return anything found.
Task: I want to delete all sections of string that contain either .ST or -XST in my vector of strings.
As you can see below, using one expression works fine. But the | expression simply does not work. I'm following the metacharacter guide on https://www.stat.auckland.ac.nz/~paul/ItDT/HTML/node84.html
What can be the issue? And what caused this issue?
My data
> rownames(table.summary)[1:10]
[1] "AAK.ST" "ABB.ST" "ALFA.ST" "ALIV-SDB.ST" "AOI.ST" "ATCO-A.ST" "ATCO-B.ST" "AXFO.ST" "AXIS.ST" "AZN.ST"
> gsub(pattern = '[.](.*)$ | [-](.*)$', replacement = "", x = rownames(table.summary)[1:10])
[1] "AAK.ST" "ABB.ST" "ALFA.ST" "ALIV-SDB.ST" "AOI.ST" "ATCO-A.ST" "ATCO-B.ST" "AXFO.ST" "AXIS.ST" "AZN.ST"
> gsub(pattern = '[.](.*)$', replacement = "", x = rownames(table.summary)[1:10])
[1] "AAK" "ABB" "ALFA" "ALIV-SDB" "AOI" "ATCO-A" "ATCO-B" "AXFO" "AXIS" "AZN"
> gsub(pattern = '[-](.*)$', replacement = "", x = rownames(table.summary)[1:10])
[1] "AAK.ST" "ABB.ST" "ALFA.ST" "ALIV" "AOI.ST" "ATCO" "ATCO" "AXFO.ST" "AXIS.ST" "AZN.ST"
It seems you tested your regex with a flag like IgnorePatternWhitespace (VERBOSE, /x) that allows whitespace inside patterns for readability. You can use it with perl=T option:
d <- c("AAK.ST","ABB.ST","ALFA.ST","ALIV-SDB.ST","AOI.ST","ATCO-A.ST","ATCO-B.ST","AXFO.ST", "AXIS.ST","AZN.ST")
gsub('(?x)[.](.*)$ | [-](.*)$', '', d, perl=T)
## [1] "AAK" "ABB" "ALFA" "ALIV" "AOI" "ATCO" "ATCO" "AXFO" "AXIS" "AZN"
However, you really do not have to use that complex regex here.
If you plan to remove all substrings from ther first hyphen or dot up to the end, you may use the following regex:
[.-].*$
The character class [.-] will match the first . or - symbol and .* wil match all characters up to the end of the string ($).
See IDEONE demo:
d <- c("AAK.ST","ABB.ST","ALFA.ST","ALIV-SDB.ST","AOI.ST","ATCO-A.ST","ATCO-B.ST","AXFO.ST", "AXIS.ST","AZN.ST")
gsub("[.-].*$", "", d)
Result: [1] "AAK" "ABB" "ALFA" "ALIV" "AOI" "ATCO" "ATCO" "AXFO" "AXIS" "AZN"
This will find .ST or -XST at the end of the text and substitute it with empty characters string (effectively removing that part). Don't forget that gsub returns modified string, not modifies it in place. You won't see any change until you reassign return value back to some variable.
strings <- c("AAK.ST", "ABB.ST", "ALFA.ST", "ALIV-SDB.ST", "AOI.ST", "ATCO-A.ST", "ATCO-B.ST", "AXFO.ST", "AXIS.ST", "AZN.ST", "AAC-XST", "AAD-XSTV")
strings <- gsub('(\\.ST|-XST)$', '', strings)
Your regular expression ([.](.*)$ | [-](.*)$'), if not for unnecessary spaces, would remove everything from first dot (.) or dash (-) to end of text. This might be what you want, but not what you said you want.

Extract 2nd to last word in string

I know how to do it in Python, but can't get it to work in R
> string <- "this is a sentence"
> pattern <- "\b([\w]+)[\s]+([\w]+)[\W]*?$"
Error: '\w' is an unrecognized escape in character string starting "\b([\w"
> match <- regexec(pattern, string)
> words <- regmatches(string, match)
> words
[[1]]
character(0)
sub('.*?(\\w+)\\W+\\w+\\W*?$', '\\1', string)
#[1] "a"
which reads - be non-greedy and look for anything until you get to the sequence - some word characters + some non-word characters + some word characters + optional non-word characters + end of string, then extract the first collection of word characters in that sequence
Non-regex solution:
string <- "this is a sentence"
split <- strsplit(string, " ")[[1]]
split[length(split)-1]
Python non regex version
spl = t.split(" ")
if len(spl) > 0:
s = spl[len(spl)-2]

How to validate a string to have only certain letters by perl and regex

I am looking for a perl regex which will validate a string containing only the letters ACGT. For example "AACGGGTTA" should be valid while "AAYYGGTTA" should be invalid, since the second string has "YY" which is not one of A,C,G,T letters. I have the following code, but it validates both the above strings
if($userinput =~/[A|C|G|T]/i)
{
$validEntry = 1;
print "Valid\n";
}
Thanks
Use a character class, and make sure you check the whole string by using the start of string token, \A, and end of string token, \z.
You should also use * or + to indicate how many characters you want to match -- * means "zero or more" and + means "one or more."
Thus, the regex below is saying "between the start and the end of the (case insensitive) string, there should be one or more of the following characters only: a, c, g, t"
if($userinput =~ /\A[acgt]+\z/i)
{
$validEntry = 1;
print "Valid\n";
}
Using the character-counting tr operator:
if( $userinput !~ tr/ACGT//c )
{
$validEntry = 1;
print "Valid\n";
}
tr/characterset// counts how many characters in the string are in characterset; with the /c flag, it counts how many are not in the characterset. Using !~ instead of =~ negates the result, so it will be true if there are no characters not in characterset or false if there are characters not in characterset.
Your character class [A|C|G|T] contains |. | does not stand for alternation in a character class, it only stands for itself. Therefore, the character class would include the | character, which is not what you want.
Your pattern is not anchored. The pattern /[ACGT]+/ would match any string that contains one or more of any of those characters. Instead, you need to anchor your pattern, so that only strings that contain just those characters from beginning to end are matched.
$ can match a newline. To avoid that, use \z to anchor at the end. \A anchors at the beginning (although it doesn't make a difference whether you use that or ^ in this case, using \A provides a nice symmetry.
So, you check should be written:
if ($userinput =~ /\A [ACGT]+ \z/ix)
{
$validEntry = 1;
print "Valid\n";
}