Matching regular expressions - regex

I have a regular expression, it's basically to update log4j syntax to log4j2 syntax, removing the string replacement. The regular expression is as follows
(?:^\(\s*|\s*\+\s*|,\s*)(?:[\w\(\)\.\d+]*|\([\w\(\)\.\d+]*\s*(?:\+|-)\s*[\w\(\)\.\d+]*\))(?:\s\+\s*|\s*\);)
This will successfully match the variables in the following strings
("Unable to retrieve things associated with this='" + thingId + "' in " + (endTime - startTime) + " ms");
("Persisting " + things.size() + " new or updated thing(s)");
("Count in use for thing=" + secondThingId + " is " + countInUse);
("Unable to check thing state '" + otherThingId + "' using '" + address + "'", e);
But not '+ thingCollection.get(0).getMyId()' in
("Exception occured while updating thingId="+ thingCollection.get(0).getMyId(), e);
I am getting better with regular expressions, but this one has me a bit stumped. Thanks!

For some reason, when some people are writing a regex pattern, they forget that the whole of the Perl language is still available
I would just delete all the strings and find the remaining substrings that look like variable names
use strict;
use warnings 'all';
use feature qw/ say fc /;
use List::Util 'uniq';
my #variables;
while ( <DATA> ) {
s/"[^"]*"//g;
push #variables, /\b[a-z]\w*/ig;
}
say for sort { fc $a cmp fc $b } uniq #variables;
__DATA__
("Unable to retrieve things associated with this='" + thingId + "' in " + (endTime - startTime) + " ms");
("Persisting " + things.size() + " new or updated thing(s)");
("Count in use for thing=" + secondThingId + " is " + countInUse);
("Unable to check thing state '" + otherThingId + "' using '" + address + "'", e);
("Exception occured while updating thingId="+ thingCollection.get(0).getMyId(), e);
output
address
countInUse
e
endTime
get
getMyId
otherThingId
secondThingId
size
startTime
thingCollection
thingId
things

You should be able to simplify your regex to match things in between '+' signs.
(?:\+)([^"]*?)(?:[\+,])
Working Example
(Note the ? after the * this makes the * lazy so it matches as little as possible to catch all occurrences)
If you want just the variable you could access the first capture group from that expression or ignore the capture group to get the full match.
Updated Version (?:\+)([^"]*?)(?:[\+,])|\s([^"+]*?)\);Working Example
Note with the new version that the variable might get placed into capture group 2 instead of 1

You might be able to pare it down to this (?:^\(\s*|\s*\+\s*|,\s*)(?:[\w().\s+]+|\([\w().\s+-]*\))(?:(?=,)|\s*\+\s*|\s*\);)
101 regex
It consolidates some constructs.
To fix the immediate problem, I added a comma in some classes.
A note that this kind of regex is fraught with problematic type of flow.
(?:
^ \( \s*
| \s* \+ \s*
| , \s*
)
(?:
[\w().\s+]+
| \( [\w().\s+-]* \)
)
(?:
(?= , )
| \s* \+ \s*
| \s* \);
)

Related

Match every thing between "****" or [****]

I have a regex that look like this:
(?="(test)"\s*:\s*(".*?"|\[.*?]))
to match the value between "..." or [...]
Input
"test":"value0"
"test":["value1", "value2"]
Output
Group1 Group2
test value0
test "value1", "value2" // or - value1", "value2
I there any trick to ignore "" and [] and stick with two group, group1 and group2?
I tried (?="(test)"\s*:\s*(?="(.*?)"|\[(.*?)])) but this gives me 4 groups, which is not good for me.
You may use this conditional regex in PHP with branch reset group:
"(test)"\h*:\h*(?|"([^"]*)"|\[([^]]*)])
This will give you 2 capture groups in both the inputs with enclosing " or [...].
RegEx Demo
RegEx Details:
(?|..) is a branch reset group. Here Subpatterns declared within each alternative of this construct will start over from the same index
(?|"([^"]*)"|\[([^]]*)]) is if-then-else conditional subpatern which means if " is matched then use "([^"]*)" otherwise use \[([^]]*)] subpattern
You can use a pattern like
"(test)"\s*:\s*\K(?|"\K([^"]*)|\[\K([^]]*))
See the regex demo.
Details:
" - a " char
(test) - Group 1: test word
" - a " char
\s*:\s* - a colon enclosed with zero or more whitespaces
\K - match reset operator that clears the current overall match memory buffer (group value is still kept intact)
(?|"\K([^"]*)|\[\K([^]]*)) - a branch reset group:
"\K([^"]*) - matches a ", then discards it, and then captures into Group 2 zero or more chars other than "
| - or
\[\K([^]]*) - matches a [, then discards it, and then captures into Group 2 zero or more chars other than ]
In Java, you can't use \K and ?|, use capturing groups:
String s = "\"test\":[\"value1\", \"value2\"]";
Pattern pattern = Pattern.compile("\"(test)\"\\s*:\\s*(?:\"([^\"]*)|\\[([^\\]]*))");
Matcher matcher = pattern.matcher(s);
while (matcher.find()){
System.out.println("Key: " + matcher.group(1));
if (matcher.group(2) != null) {
System.out.println("Value: " + matcher.group(2));
} else {
System.out.println("Value: " + matcher.group(3));
}
}
See a Java demo.

RegEx to match with single occurrence of dash anywhere in [A-Z0-9]+ with total occurrence of 20 chars

I couldn't figure out a regex to match with single occurrence of dash anywhere in [A-Z0-9]+ with max occurrence of 20 chars, so it's like - and [A-Z0-9]+ altogether max 20 chars.
This is the closest pattern I can get but didn't work
([A-Z0-9]{1,19}|\-{1})
Why use a regex, especially a single regex? These conditions are much easier to check separately.
For example, using Perl:
if (length($str) <= 20 && $str =~ /\A[A-Z0-9]*-[A-Z0-9]*\z/)
Another option is to use a positive lookahead and assert the length to 1 - 20 chars:
^(?=.{1,20}$)[A-Z0-9]*-[A-Z0-9]*$
Depending on the tool or language, if you want to use different anchors than ^ and $ to match the start and end of the string or line you might look at this page.
For example:
let pattern = /^(?=.{1,20}$)[A-Z0-9]*-[A-Z0-9]*$/;
[
"AAAAAAAAAA-AAAAAAAAA",
"-",
"A-A",
"-A",
"A-",
"A",
"AAAAAAAAAAA-AAAAAAAAA",
"AAAAAAAAAAAAAAAAAAAA",
].forEach(s => {
if (pattern.test(s)) {
console.log("Match: '" + s + "' (Nr of chars: " + s.length + ")");
} else {
console.log("No match: '" + s + "' (Nr of chars: " + s.length + ")");
}
});

Autohotkey RegExReplace Skip unmatched pattern

How to skip an unmatched line in input on replacing by regex?
For Ex. Below is the contents of my test.txt
elkay_iyer#yahoo.com
elkay_qwer#yahoo.com
elke engineering ltd.,#yahoo.com
elke0265#yahoo.com
elke#yahoo.com
Below is my Autohotkey script with regex code
ReplaceEmailsRegEx := "i)([a-z0-9]+(\.*|\_*|\-*))+#([a-z][a-z0-9\-]+(\.|\-*\.))+[a-z]{2,6}"
RemoveDuplicateCharactersRegEx := "s)(.)(?=.*\1)"
Try{
FileRead, EmailFromTxtFile, test.txt
OtherThanEmails :=RegExReplace(EmailFromTxtFile,ReplaceEmailsRegEx)
Chars :=RegExReplace(OtherThanEmails,RemoveDuplicateCharactersRegEx)
Loop{
StringReplace, OtherThanEmails, OtherThanEmails, `r`n`r`n,`r`n, UseErrorLevel
If ErrorLevel = 0
Break
}
If (StrLen(OtherThanEmails)){
Msgbox The Characters found other than email:`n%OtherThanEmails%
}
}
catch e {
ErrorString:="what: " . e.what . "file: " . e.file . " line: " . e.line . " msg: " . e.message . " extra: " . e.extra
Msgbox An Exception was thrown`n%ErrorString%
}
Return
When it replace on test.txt it throws error:
e.what contains 'RegExReplace', e.line is 10
It executes without error when I remove 3rd email in test.txt. So how to change my regex to skip the problematic string?
The problem you have is catastrophic backtracking due to the nested quantifier in the beginning: ([a-z0-9]+(\.*|\_*|\-*))+. Here, the ., _ and - are all optional due to the * quantifier and thus your pattern gets reduced to ([a-z0-9]+)+.
I suggest "unrolling" the first subpattern to make it linear:
i)[a-z0-9]+(?:(?:\.+|_+|-+)[a-z0-9]+)*#([a-z][-a-z0-9]+\.)+[a-z]{2,6}
Or
i)[a-z0-9]+(?:([._-])\1*[a-z0-9]+)*#(?:[a-z][-a-z0-9]+\.)+[a-z]{2,6}
You may even remove \1* if you do not allow more than 1 . or _ or - in between "words".
Also, there is no need in using \-* with alternation in (\.|\-*\.), as the hyphen is matched with the previous character class, thus, this subpattern can be reduced to \..
See the regex demo

Reg Exp matching multiple instances

I have to match multiple instances of either "int(" or "der("
So the expression must match these strings
VVEH + int(ACC_X) + der(FL_WSP)
VVEH + int(ACC_X) + int(FL_WSP)
VVEH + der(ACC_X) + der(FL_WSP)
and not these
VVEH + int(ACC_X) + log(FL_WSP)
VVEH + der(ACC_X) + log(FL_WSP)
VVEH( \+ (int|der)\([^)]+\)){2,}
VVEH #Initial string
(
\+ #Escape the 'plus'
(int|der) #Either of your function names
\( #Escape the bracket
[^)]+ #Match anything inside the brackets
\) #Escape the bracket
){2,} #All of that stuff above at least twice

Matching math expression with regular expression?

For example, these are valid math expressions:
a * b + c
-a * (b / 1.50)
(apple + (-0.5)) * (boy - 1)
And these are invalid math expressions:
--a *+ b # 1.5.0 // two consecutive signs, two consecutive operators, invalid operator, invalid number
-a * b + 1) // unmatched parentheses
a) * (b + c) / (d // unmatched parentheses
I have no problem with matching float numbers, but have difficulty with parentheses matching. Any idea? If there is better solution than regular expression, I'll accept as well. But regex is preferred.
========
Edit:
I want to make some comments on my choice of the “accepted answer”, hoping that people who have the same question and find this thread will not be misled.
There are several answers I consider “accepted”, but I have no idea which one is the best. So I chose the accepted answer (almost) randomly. I recommend reading Guillaume Malartre’s answer as well besides the accepted answer. All of them give practical solutions to my question. For a somewhat rigorous/theoretical answer, please read David Thornley’s comments under the accepted answer. As he mentioned, Perl’s extension to regular expression (originated from regular language) make it “irregular”. (I mentioned no language in my question, so most answerers assumed the Perl implementation of regular expression – probably the most popular implementation. So did I when I posted my question.)
Please correct me if I said something wrong above.
Use a pushdown automaton for matching paranthesis http://en.wikipedia.org/wiki/Pushdown_automaton (or just a stack ;-) )
Details for the stack solution:
while (chr available)
if chr == '(' then
push '('
else
if chr == ')' then
if stack.elements == 0 then
print('too many or misplaced )')
exit
else
pop //from stack
end while
if (stack.elements != 0)
print('too many or misplaced(')
Even simple: just keep a counter instead of stack.
Regular expressions can only be used to recognize regular languages. The language of mathematical expressions is not regular; you'll need to implement an actual parser (e.g. LR) in order to do this.
Matching parens with a regex is quite possible.
Here is a Perl script that will parse arbitrary deep matching parens. While it will throw out the non-matching parens outside, I did not design it specifically to validate parens. It will parse arbitrarily deep parens so long as they are balanced. This will get you started however.
The key is recursion both in the regex and the use of it. Play with it, and I am sure that you can get this to also flag non matching prens. I think if you capture what this regex throws away and count parens (ie test for odd parens in the non-match text), you have invalid, unbalanced parens.
#!/usr/bin/perl
$re = qr /
( # start capture buffer 1
\( # match an opening paren
( # capture buffer 2
(?: # match one of:
(?> # don't backtrack over the inside of this group
[^()]+ # one or more
) # end non backtracking group
| # ... or ...
(?1) # recurse to opening 1 and try it again
)* # 0 or more times.
) # end of buffer 2
\) # match a closing paren
) # end capture buffer one
/x;
sub strip {
my ($str) = #_;
while ($str=~/$re/g) {
$match=$1; $striped=$2;
print "$match\n";
strip($striped) if $striped=~/\(/;
return $striped;
}
}
while(<DATA>) {
print "start pattern: $_";
while (/$re/g) {
strip($1) ;
}
}
__DATA__
"(apple + (-0.5)) * (boy - 1)"
"((((one)two)three)four)x(one(two(three(four))))"
"a) * (b + c) / (d"
"-a * (b / 1.50)"
Output:
start pattern: "(apple + (-0.5)) * (boy - 1)"
(apple + (-0.5))
(-0.5)
(boy - 1)
start pattern: "((((one)two)three)four)x(one(two(three(four))))"
((((one)two)three)four)
(((one)two)three)
((one)two)
(one)
(one(two(three(four))))
(two(three(four)))
(three(four))
(four)
start pattern: "a) * (b + c) / (d"
(b + c)
start pattern: "-a * (b / 1.50)"
(b / 1.50)
I believe you will be better off implementing a real parser to accomplish what you're after.
A parser for simple mathematical expressions is "Parsing 101", and there are several examples to be found online.
Some examples include:
ANTLR: Expression Evaluator Sample (ANTLR grammars can target several languages)
pyparsing: http://pyparsing.wikispaces.com/file/view/fourFn.py (pyparsing is a Python library)
Lex & Yacc: http://epaperpress.com/lexandyacc/ (contains a PDF tutorial and sample code for a calculator)
Note that the grammar you will need for validating expressions is simpler than the examples above, since the examples also implement evaluation of the expression.
You can't use regex to do things like balance parenthesis.
This is tricky with one single regular expression, but quite easy using mixed regexp/procedural approach. The idea is to construct a regexp for the simple expression (without parenthesis) and then repeatedly replace ( simple-expression ) with some atomic string (e.g. identifier). If the final reduced expression matches the same `simple' pattern, the original expression is considered valid.
Illustration (in php).
function check_syntax($str) {
// define the grammar
$number = "\d+(\.\d+)?";
$ident = "[a-z]\w*";
$atom = "[+-]?($number|$ident)";
$op = "[+*/-]";
$sexpr = "$atom($op$atom)*"; // simple expression
// step1. remove whitespace
$str = preg_replace('~\s+~', '', $str);
// step2. repeatedly replace parenthetic expressions with 'x'
$par = "~\($sexpr\)~";
while(preg_match($par, $str))
$str = preg_replace($par, 'x', $str);
// step3. no more parens, the string must be simple expression
return preg_match("~^$sexpr$~", $str);
}
$tests = array(
"a * b + c",
"-a * (b / 1.50)",
"(apple + (-0.5)) * (boy - 1)",
"--a *+ b # 1.5.0",
"-a * b + 1)",
"a) * (b + c) / (d",
);
foreach($tests as $t)
echo $t, "=", check_syntax($t) ? "ok" : "nope", "\n";
The above only validates the syntax, but the same technique can be also used to construct a real parser.
For parenthesis matching, and implementing other expression validation rules, it is probably easiest to write your own little parser. Regular expressions are no good in this kind of situation.
Ok here's my version of parenthesis finding in ActionScript3, using this approach give a lot of traction to analyse the part before the parenthesis, inside the parenthesis and after the parenthis, if some parenthesis remains at the end you can raise a warning or refuse to send to a final eval function.
package {
import flash.display.Sprite;
import mx.utils.StringUtil;
public class Stackoverflow_As3RegexpExample extends Sprite
{
private var tokenChain:String = "2+(3-4*(4/6))-9(82+-21)"
//Constructor
public function Stackoverflow_As3RegexpExample() {
// remove the "\" that just escape the following "\" if you want to test outside of flash compiler.
var getGroup:RegExp = new RegExp("((?:[^\\(\\)]+)?) (?:\\() ( (?:[^\\(\\)]+)? ) (?:\\)) ((?:[^\\(\\)]+)?)", "ix") //removed g flag
while (true) {
tokenChain = replace(tokenChain,getGroup)
if (tokenChain.search(getGroup) == -1) break;
}
trace("cummulativeEvaluable="+cummulativeEvaluable)
}
private var cummulativeEvaluable:Array = new Array()
protected function analyseGrammar(matchedSubstring:String, capturedMatch1:String, capturedMatch2:String, capturedMatch3:String, index:int, str:String):String {
trace("\nanalyseGrammar str:\t\t\t\t'"+str+"'")
trace("analyseGrammar matchedSubstring:'"+matchedSubstring+"'")
trace("analyseGrammar capturedMatchs:\t'"+capturedMatch1+"' '("+capturedMatch2+")' '"+capturedMatch3+"'")
trace("analyseGrammar index:\t\t\t'"+index+"'")
var blank:String = buildBlank(matchedSubstring.length)
cummulativeEvaluable.push(StringUtil.trim(matchedSubstring))
// I could do soo much rigth here!
return str.substr(0,index)+blank+str.substr(index+matchedSubstring.length,str.length-1)
}
private function replace(str:String,regExp:RegExp):String {
var result:Object = regExp.exec(str)
if (result)
return analyseGrammar.apply(null,objectToArray(result))
return str
}
private function objectToArray(value:Object):Array {
var array:Array = new Array()
var i:int = 0
while (true) {
if (value.hasOwnProperty(i.toString())) {
array.push(value[i])
} else {
break;
}
i++
}
array.push(value.index)
array.push(value.input)
return array
}
protected function buildBlank(length:uint):String {
var blank:String = ""
while (blank.length != length)
blank = blank+" "
return blank
}
}
}
It should trace this:
analyseGrammar str: '2+(3-4*(4/6))-9(82+-21)'
analyseGrammar matchedSubstring:'3-4*(4/6)'
analyseGrammar capturedMatchs: '3-4*' '(4/6)' ''
analyseGrammar index: '3'
analyseGrammar str: '2+( )-9(82+-21)'
analyseGrammar matchedSubstring:'2+( )-9'
analyseGrammar capturedMatchs: '2+' '( )' '-9'
analyseGrammar index: '0'
analyseGrammar str: ' (82+-21)'
analyseGrammar matchedSubstring:' (82+-21)'
analyseGrammar capturedMatchs: ' ' '(82+-21)' ''
analyseGrammar index: '0'
cummulativeEvaluable=3-4*(4/6),2+( )-9,(82+-21)