regular expression with space - regex

I am using regular expression in R with the following code:
> temp <- c("Herniorrhaphy, left inguinal", "Herniorrhaphy, right inguinal")
> grep("Herniorrhaphy, [left|right] inguinal",temp)
integer(0)
> grep("Herniorrhaphy, [left inguinal|right inguinal]",temp)
[1] 1 2
I wonder why the two regular expression give difference result, thanks.

According to regexp explanation in the documentation (http://stat.ethz.ch/R-manual/R-devel/library/base/html/regex.html):
Note that alternation does not work
inside character classes, where | has
its literal meaning.
That explains why the first alternative doesn't return any results because '[' and ']' characters denote a character class. The correct sytax should be:
grep("Herniorrhaphy, (left|right) inguinal",temp)
On my R, the second alternative also returns empty set as well:
> temp <- c("Herniorrhaphy, left inguinal", "Herniorrhaphy, right inguinal")
> grep("Herniorrhaphy, [left inguinal|right inguinal] inguinal",temp)
integer(0)
>
Are you sure you are copying directly from the workspace?

I think you want brackets ( ) not character class [ ], ie
"Herniorrhaphy, (left|right) inguinal"
"Herniorrhaphy, (left inguinal|right inguinal)"

Related

Removing everything between nested parentheses

For removing everything between parentheses, currently i use:
SELECT
REGEXP_REPLACE('(aaa) bbb (ccc (ddd) / eee)', "\\([^()]*\\)", "");
Which is incorrect, because it gives bbb (ccc / eee), as that removes inner parentheses only.
How to remove everynting between nested parentheses? so expected result from this example is bbb
In case of Google BigQuery, this is only possible if you know your maximum number of nestings. Because it uses re2 library that doesn't support regex recursions.
let r = /\((?:(?:\((?:[^()])*\))|(?:[^()]))*\)/g
let s = "(aaa) bbb (ccc (ddd) / eee)"
console.log(s.replace(r, ""))
If you can iterate on the regular expression operation until you reach a fixed point you can do it like this:
repeat {
old_string = string
string := remove_non_nested_parens_using_regex(string)
} until (string == old_string)
For instance if we have
((a(b)) (c))x)
on the first iteration we remove (b) and (c): sequences which begin with (, end with ) and do not contain parentheses, matched by \([^()]*\). We end up with:
((a) )x)
Then on the next iteration, (a) is gone:
( )x)
and after one more iteration, ( ) is gone:
x)
when we try removing more parentheses, there is no more change, and so the algorithm terminates with x).

Remove substrings between < and > (including the brackets) with no angle brackets inside

I have to modify a html-like text with the sed command. I have to delete substrings starting with one or more < chars, then having 0 or more occurrences of any characters but angle brackets and then any 1 or more > chars.
For example: from
aaa<bbb>ccc I would like to get aaaccc
I am able to do this with
"s/<[^>]\+>//g"
but this command doesn't work if between <> characters is an empty string, or if there is double <<>> in the text.
For example, from
aa<>bb<cc>vv<<gg>>h
I get
aa<>bbvv>h
instead of
aabbvvh
How can I modify it to give me the right result?
The problem is that once you allow nesting the < and > characters, you convert the language type from "regular" to "context free".
Regular languages are those that are matched by regular expressions, while context free grammars cannot be parsed in general by a regular expression. The unbounded level of nesting is what impedes this, needing a pile based automaton to be able to parse such languages.
But there's a little complicated workaround to this, if you consider that there's an upper limit to the level of nesting you will allow in the text you are facing, then you can convert into regular a language that is not, based on the premise that the non-regular cases will never occur:
Let's suppose you will never have more than three levels of nesting into your pattern, (this allows you to see the pattern and be able to extend it to N levels) you can use the following algorithm to build a regular expression that will allow you to match three levels of nesting, but no more (you can make a regexp to parse N levels, but no more, this is the umbounded bounded nature of regexps :) ).
Let's construct the expression recursively from the bottom up. With only one level of nesting, you have only < and > and you cannot find neither of these inside (if you allow < you allow more nesting levels, which is forbidden at level 0):
{l0} = [^<>]*
a string including no < and > characters.
Your matching text will be of this class of strings, surrounded by a pair of < and > chars:
{l1} = <[^<>]*>
Now, you can build a second level of nesting by alternating {l0}{l1}{l0}{l1}...{l0} (this is, {l0}({l1}{l0})* and surrounding the whole thing with < and >, to build {l2}
{l2} = <{l0}({l1}{l0})*> = <[^<>]*(<[^<>]*>[^<>]*)*>
Now, you can build a third, by alternating sequences of {l0} and {l2} in a pair of brackets... (remember that {l-i} represents a regexp that allows upto i levels of nesting or less)
{l3} = <{l0}({l2}{l0})*> = <[^<>]*(<[^<>]*(<[^<>]*>[^<>]*)*>[^<>]*)*>
and so on, successively, you form a sequence of
{lN} = <{l0}({l(N-1)}{l0})*>
and stop when you consider there will not be a deeper nesting in your input file.
So your level three regexp is:
<[^<>]*(<[^<>]*(<[^<>]*>[^<>]*)*>[^<>]*)*>
{l3--------------------------------------}
<{l0--}({l2---------------------}{l0--})*>
<{l0--}({l1----}{l0--})*>
<{l0--}>
You can see that the regexp grows as you consider more levels. The good things is that you can consider a maximum level of three or four and most text will fit in this cathegory.
See demo.
NOTE
Never hesitate to build a regular expression, despite of it appearing somewhat complex. Think that you can build it inside your program, just using the techniques I've used to build it (e.g. for a 16 level nesting regexp, you'll get a large string, very difficult to write it by hand, but very easy to build with a computer)
package com.stackoverflow.q61630608;
import java.util.regex.Pattern;
public class NestingRegex {
public static String build_regexp( char left, char right, int level ) {
return level == 0
? "[^" + left + right + "]*"
: level == 1
? left + build_regexp( left, right, 0 ) + right
: left + build_regexp( left, right, 0 )
+ "(" + build_regexp( left, right, level - 1 )
+ build_regexp( left, right, 0 )
+ ")*" + right;
}
public static void main( String[] args ) {
for ( int i = 0; i < 5; i++ )
System.out.println( "{l" + i + "} = "
+ build_regexp( '<', '>', i ) );
Pattern pat = Pattern.compile( build_regexp( '<', '>', 16 ), 0 );
String s = "aa<>bb<cc>vv<<gg>>h<iii<jjj>kkk<<lll>mmm>ooo>ppp";
System.out.println(
String.format( "pat.matcher(\"%s\").replaceAll(\"#\") => %s",
s, pat.matcher( s ).replaceAll( "#" ) ) );
}
}
which, on run gives:
{l0} = [^<>]*
{l1} = <[^<>]*>
{l2} = <[^<>]*(<[^<>]*>[^<>]*)*>
{l3} = <[^<>]*(<[^<>]*(<[^<>]*>[^<>]*)*>[^<>]*)*>
{l4} = <[^<>]*(<[^<>]*(<[^<>]*(<[^<>]*>[^<>]*)*>[^<>]*)*>[^<>]*)*>
pat.matcher("aa<>bb<cc>vv<<gg>>h<iii<jjj>kkk<<lll>mmm>ooo>ppp").replaceAll("#") => aa#bb#vv#h#ppp
The main advantage of using regular expressions is that once you have written it, it compiles into an internal representation that only has to visit each character of the string being matched once, leading to a very efficient final matching code (probably you'll not get so efficient writing the code yourself)
Sed
for sed, you only need to generate an enough deep regexp, and use it to parse your text file:
sed 's/<[^<>]*\(<[^<>]*\(<[^<>]*\(<[^<>]*\(<[^<>]*\(<[^<>]*>[^<>]*\)*>[^<>]*\)*>[^<>]*\)*>[^<>]*\)*>[^<>]*\)*>//g' file1.xml
will give you appropiate results (this is 6 levels of nesting or less ---remember the ( and ) must be escaped to be considered group delimiters in sed)
Your regexp can be constructed using shell variables with the following approach:
l0="[^<>]*"
l1="<${l0}>"
l2="<${l0}\(${l1}${l0}\)*>"
l3="<${l0}\(${l2}${l0}\)*>"
l4="<${l0}\(${l3}${l0}\)*>"
l5="<${l0}\(${l4}${l0}\)*>"
l6="<${l0}\(${l5}${l0}\)*>"
echo regexp is "${l6}"
regexp is <[^<>]*\(<[^<>]*\(<[^<>]*\(<[^<>]*\(<[^<>]*\(<[^<>]*>[^<>]*\)*>[^<>]*\)*>[^<>]*\)*>[^<>]*\)*>[^<>]*\)*>
sed -e "s/${l6}/#/g" <<EOF
aa<>bb<cc>vv<<gg>>h<iii<jj<>j>k<k>k<<lll>mmm>ooo>ppp
EOF
aa#bb#vv#h#ppp
(I've used # as substitution pattern, instead, so you can see where in the input string have the patterns been detected)
You may use
sed 's/<\+[^>]*>\+//g'
sed 's/<\{1,\}[^>]*>\{1,\}//g'
sed -E 's/<+[^>]*>+//g'
The patterns match
<\+ / <\{1,\} - 1 or more occurrences of < char
[^>]* - negated bracket expression that matches 0 or more chars other than >
>\+ / >\{1,\} - 1 or more occurrences of > char
Note that in the last, POSIX ERE, example, + that is unescaped is a quantifier matching 1 or more occurrences, same as \+ in the POSIX BRE pattern.
See the online sed demo:
s='aa<>bb<cc>vv<<gg>>h'
sed 's/<\+[^>]*>\+//g' <<< "$s"
sed 's/<\{1,\}[^>]*>\{1,\}//g' <<< "$s"
sed -E 's/<+[^>]*>+//g' <<< "$s"
Result of each sed command is aabbvvh.

Subdivide an expression into alternative subpattern - using gsub()

I'm trying to subdivide my metacharacter expression in my gsub() function. But it does not return anything found.
Task: I want to delete all sections of string that contain either .ST or -XST in my vector of strings.
As you can see below, using one expression works fine. But the | expression simply does not work. I'm following the metacharacter guide on https://www.stat.auckland.ac.nz/~paul/ItDT/HTML/node84.html
What can be the issue? And what caused this issue?
My data
> rownames(table.summary)[1:10]
[1] "AAK.ST" "ABB.ST" "ALFA.ST" "ALIV-SDB.ST" "AOI.ST" "ATCO-A.ST" "ATCO-B.ST" "AXFO.ST" "AXIS.ST" "AZN.ST"
> gsub(pattern = '[.](.*)$ | [-](.*)$', replacement = "", x = rownames(table.summary)[1:10])
[1] "AAK.ST" "ABB.ST" "ALFA.ST" "ALIV-SDB.ST" "AOI.ST" "ATCO-A.ST" "ATCO-B.ST" "AXFO.ST" "AXIS.ST" "AZN.ST"
> gsub(pattern = '[.](.*)$', replacement = "", x = rownames(table.summary)[1:10])
[1] "AAK" "ABB" "ALFA" "ALIV-SDB" "AOI" "ATCO-A" "ATCO-B" "AXFO" "AXIS" "AZN"
> gsub(pattern = '[-](.*)$', replacement = "", x = rownames(table.summary)[1:10])
[1] "AAK.ST" "ABB.ST" "ALFA.ST" "ALIV" "AOI.ST" "ATCO" "ATCO" "AXFO.ST" "AXIS.ST" "AZN.ST"
It seems you tested your regex with a flag like IgnorePatternWhitespace (VERBOSE, /x) that allows whitespace inside patterns for readability. You can use it with perl=T option:
d <- c("AAK.ST","ABB.ST","ALFA.ST","ALIV-SDB.ST","AOI.ST","ATCO-A.ST","ATCO-B.ST","AXFO.ST", "AXIS.ST","AZN.ST")
gsub('(?x)[.](.*)$ | [-](.*)$', '', d, perl=T)
## [1] "AAK" "ABB" "ALFA" "ALIV" "AOI" "ATCO" "ATCO" "AXFO" "AXIS" "AZN"
However, you really do not have to use that complex regex here.
If you plan to remove all substrings from ther first hyphen or dot up to the end, you may use the following regex:
[.-].*$
The character class [.-] will match the first . or - symbol and .* wil match all characters up to the end of the string ($).
See IDEONE demo:
d <- c("AAK.ST","ABB.ST","ALFA.ST","ALIV-SDB.ST","AOI.ST","ATCO-A.ST","ATCO-B.ST","AXFO.ST", "AXIS.ST","AZN.ST")
gsub("[.-].*$", "", d)
Result: [1] "AAK" "ABB" "ALFA" "ALIV" "AOI" "ATCO" "ATCO" "AXFO" "AXIS" "AZN"
This will find .ST or -XST at the end of the text and substitute it with empty characters string (effectively removing that part). Don't forget that gsub returns modified string, not modifies it in place. You won't see any change until you reassign return value back to some variable.
strings <- c("AAK.ST", "ABB.ST", "ALFA.ST", "ALIV-SDB.ST", "AOI.ST", "ATCO-A.ST", "ATCO-B.ST", "AXFO.ST", "AXIS.ST", "AZN.ST", "AAC-XST", "AAD-XSTV")
strings <- gsub('(\\.ST|-XST)$', '', strings)
Your regular expression ([.](.*)$ | [-](.*)$'), if not for unnecessary spaces, would remove everything from first dot (.) or dash (-) to end of text. This might be what you want, but not what you said you want.

Replace a random block of characters in a string in R

I have a text and I want to replace a text block in a line, like that:
"\t\t\tFGHGFJKJKJKGDSJS"
with
x= "ABCCCBBHHJJJH"
I'm interested in changing just the text block (FGHGFJKJKJKGDSJS) without modyfing the presence of other special characters. So obtaining:
"\t\t\tABCCCBBHHJJJH"
Do it exist a way to replace FGHGFJKJKJKGDSJS without clearly specify the exact combination of letters?
I found a solution in this way: txt[n° of the line] = paste0(\t,\t,\t,x)
But I would like to know whether there is a more general solution.
> library(stringr)
> mystring <- "\t\t\tFGHGFJKJKJKGDSJS"
> x <- "ABCCCBBHHJJJH"
> str_replace(mystring,"\\w+",x)
[1] "\t\t\tABCCCBBHHJJJH"
\w+mean match any character or number or underscore at least once and as many as possible. So each part not a normal char will be replace by your x variable.
> a = "\t\t\tDFGGD"
> gsub("(\t\t\t).*","\\1ABCDF",a)
[1] "\t\t\tABCDF
mystring <- "\t\t\tFGHGFJKJKJKGDSJS"
x <- "ABCCCBBHHJJJH"
sub('\\w+',x,mystring,ignore.case=T)

Delphi RegEx Parenthesis Parser

I am looking for an regular expression as generally solution.
This regular expression is used to obtain parenthetical functions and parameters.
Input:
...alotOfText...
DBINFO("Parameter1"|'FirstFunction(Parameter)'|Parameter3|SecondFunction("Parameter1"|Parameter2)")
...alotOfOtherText...
Current regex:
cRegex =
'DBINFO\('// Looking for DBINFO(
+ '(?:' // Recursion for following Pattern(s)
+ '[^\)]' // no "("
+ '|(?R))' // or Repeat the Recursion (am i right?) I don't really understand this line
+ '*\)' // Quantifier for recursion (?) with unlimited Chars and one ")" at the end.
;
For inputs with only one set of () this works, but as soon as I need to parse the input mentioned above, the matches are only until the first occurrence of a ).
So I researched that multiple levels of parenthesis need to use sub routines. But even on my primary information source I can't find an example that brings me back on track. http://www.regular-expressions.info/subroutine.html
Remarks:
Each parameter could be blank, with " or with ' (mixed)
Source:
hRegEx := TRegEx.Create(cRegex), [roIgnoreCase, roMultiLine]);
hMatchCollection := hRegEx.Matches(aLayoutString);
for hMatch in hMatchCollection do
// Regarding the Regular Expression there should only be one Match in the Collection.
//Thats Subject to Change
begin
if hMatch.Success then
begin
Result := ParseParameter(hMatch.Value);
end;
end;
If you give an example: Please comment on it as mine. I want to believe .. ah learn. :)
Found!
cRegex =
'DBINFO' // some Searchinfo outside the parenthesis Expression
+ '(' // Outer Match Start for (?1)
+ '\(' // Search one "("
+ '(' // "SubGroup" Start
+ '(?>[^()]+)' // SubPattern: everything that is non-parentheses
+ '|(?1)' // or recursive match of the Subpattern 1
+ ')' // "SubGroup" End
+ '*\)' // any Numer of "SubGroup" and one ")"
+ ')' // Outer Match End
;
I was wrong with my first Expression. The Paranthesis Expression itself was perfectly fine. So this seems to work fine.
Found at:
http://mushclient.com/pcre/pcrepattern.html#SEC19
If someone with more knowledge could correct my Comments about the Expression. First i am using the wrong Names. Second i am not sure if (?1) reffers to the Inner () or the Outer () Match. And i dont know how to format Expressions.