Regex expression format is different in AEM dispatcher - regex

When we create the regex which has a forward slash, then we need to put a backslash before forward-slash since the forward slash is the unescaped delimiter. For example, if I want my regex to match /content/att, then I need to put regex like this
/content\/att. And this works too.
But when we add the dispatcher rule in AEM to allow a url path, the backslash is not needed for the unescaped delimiter. I would appreciate is someone can help me understand this, I mean why we need the backslash when we write the regex, but not when using the same regex in the url path of the dispatcher rule.
In dispatcher, look at the url path – there is no backslash before /att
/type "allow"
/url "/content/att"
/extension '(gif)'
}

I am not familiar with the AEM Dispatcher, but here is the generic answer to your regular expression question:
That is because "/content/att" is the string representation of the regex. The actual regex of that is "/\/content\/att/". Notice that the initial slash in the string is also escaped.
Here is an example: These two JavaScript regular expressions are identical:
let regex1 = /^\/content\/att/;
let regex2 = new RegExp( "^/content/att" );

Short answer: because these are two different types of regex representations.
Long answer:
Historically, regexes have first appeared in text edtors like QED and ed. There, regexes were used for string substitutions (search and replace). The tools needed some way to distinguish the search regex from the replacement string, which is why we got the delimiter. A command to replace some text in ed, for example, would be s«DELIMITER»search-regex«DELIMITER»substitution-string«DELIMITER»flags.
Most single-char delimiters would work but / was often chosen. Of course, it was possible to use the delimiter as part of the regex or the substitution, in which case it would have to be escaped using backslash.
Some programming languages have codified / as the de-facto standard delimiter for regex literals. JavaScript is an example for this.
Now, usages that have no need for a regex to be separated from the substitution (or allow for regex flags) don’t use delimiters at all. Such is the case in Java, where there are no regex literals, regexes are always created from a string using the Pattern class. Which is why, in AEM you don’t need to escape the /.
You didn’t show us your apache dispatcher config file so I’m not sure where you’re escaping the slash there. I know that apache’s mod_rewrite also doesn’t use delimited regexes.

Related

How to properly escape Regular Expression pattern in XSD schema?

I need to fulfill a requirement to only accept values in the form of MM/DD/YYYY.
From what I've read on: https://www.w3.org/TR/xmlschema11-2/#nt-dateRep
Using
<xs:simpleType name="DATE">
<xs:restriction base="xs:date"/>
</xs:simpleType>
Is not going to work as its regex apparently is not supporting this format.
I have found and adjusted this format:
^(?:(?:(?:0?[13578]|1[02])(\/)31)\1|(?:(?:0?[1,3-9]|1[0-2])(\/)(?:29|30)\2))(?:(?:1[6-9]|[2-9]\d)?\d{2})$|^(?:0?2(\/)29\3(?:(?:(?:1[6-9]|[2-9]\d)?(?:0[48]|[2468][048]|[13579][26])|(?:(?:16|[2468][048]|[3579][26])00))))$|^(?:(?:0?[1-9])|(?:1[0-2]))(\/)(?:0?[1-9]|1\d|2[0-8])\4(?:(?:1[6-9]|[2-9]\d)?\d{2})$
To this form:
\^\(\?:\(\?:\(\?:0\?\[13578\]\|1\[02\]\)\(\\/\)31\)\1\|\(\?:\(\?:0\?\[1,3-9\]\|1\[0-2\]\)\(\\/\)\(\?:29\|30\)\2\)\)\(\?:\(\?:1\[6-9\]\|\[2-9\]\d\)\?\d{2}\)$\|\^\(\?:0\?2\(\\/\)29\3\(\?:\(\?:\(\?:1\[6-9\]\|\[2-9\]\d\)\?\(\?:0\[48\]\|\[2468\]\[048\]\|\[13579\]\[26\]\)\|\(\?:\(\?:16\|\[2468\]\[048\]\|\[3579\]\[26\]\)00\)\)\)\)$\|\^\(\?:\(\?:0\?\[1-9\]\)\|\(\?:1\[0-2\]\)\)\(\\/\)\(\?:0\?\[1-9\]\|1\d\|2\[0-8\]\)\4\(\?:\(\?:1\[6-9\]\|\[2-9\]\d\)\?\d{2}\)$
Now I no longer get invalid escaping errors in XML editors (using XML Spy), but I get this one:
invalid-escape: The given character escape is not recognized.
I have done the escape according to the XML schema specifications here:
https://www.w3.org/TR/xmlschema-2/#regexs Section F.1.1 there is an escape table.
Can anyone please help to nail this down right?
Thanks!
If you check the XSD regex syntax resources, you will notice that there is no support for non-capturing groups ((?:...)), nor backreferences (the \n like entities to refer to the text captured with capturing groups, (...)).
Since the only delimiter is /, you can get rid of the backreference completely.
Use
((((0?[13578]|1[02])/31)/|((0?[13-9]|1[0-2])/(29|30)/))((1[6-9]|[2-9]\d)?\d{2}‌​)|(0?2/29/(((1[6-9]|[2-9]\d)?(0[48]|[2468][048]|[13579][26])|((16|[2468][048]|[35‌​79][26])00))))|(0?[1-9]|1[0-2])/(0?[1-9]|1\d|2[0-8])/(1[6-9]|[2-9]\d)?\d{2})
See this regex demo
Note that acc. to regular-expressions.info:
Particularly noteworthy is the complete absence of anchors like the caret and dollar, word boundaries, and lookaround. XML schema always implicitly anchors the entire regular expression. The regex must match the whole element for the element to be considered valid.
So, you should not use ^ (start of string) and $ (end of string) in XSD regex.
The / symbol is escaped in regex flavors where it is a regex delimiter, and in XSD regex, there are no regex delimiters (as the only action is matching, and there are no modifiers: XML schemas do not provide a way to specify matching modes). So, do not escape / in XSD regex.
TESTING AT ONLINE TESTERS NOTE
If you test at regex101.com or similar sites, note that in most cases you need to escape the / if it is selected as a regex delimiter. You can safely remove the \ before / after you finished testing.
OK, so you're starting from this (I'm going to insert newlines for readability):
^(?:(?:(?:0?[13578]|1[02])(\/)31)\1|(?:(?:0?[1,3-9]|1[0-2])(\/)
(?:29|30)\2))(?:(?:1[6-9]|[2-9]\d)?\d{2})$
|^(?:0?2(\/)29\3(?:(?:(?:1[6-9]|[2-9]\d)?(?:0[48]|[2468][048]|[13579][26])|(?:(?:16|[2468][048]|[3579][26])00))))$|
^(?:(?:0?[1-9])|(?:1[0-2]))(\/)(?:0?[1-9]|1\d|2[0-8])\4(?:(?:1[6-9]|[2-9]\d)?\d{2})$
Horrendous stuff. Now, in XSD:
(a) there are no ^ and $ anchors, they aren't needed (the pattern is implicitly anchored). So take them out. You've responded by escaping them as \^ and \$ but that doesn't make sense: you don't actually want circumflexes and dollar signs in your input.
(b) XSD doesn't recognize non-capturing groups (?:xxxx). Just replace them with capturing groups - that is, remove the ?: Again, you've escaped the question marks, which doesn't make any sense at all.
(c) The \d should probably be [0-9], unless you actually want to match non-ASCII digits (e.g. Thai or Eastern Arabic digits)
(d) Slash (/) doesn't need to be escaped, and indeed can't be escaped. So replace \/ with /.
(e) I see some back-references, \1, \2, \4. XSD regexes do not allow back-references. But as far as I can see, the back-references in this regex serve no useful purpose. Most of them seem to be back-references to a group of the form (\/) which can only match a single slash, so the back-reference \1 can be simply replaced with /. Maybe they are throwbacks to some earlier form of the regex that allowed alternative delimiters but required them to be consistent.
From your attempts to fix the problems here, it seems to me that you don't have a very thorough understanding of regular expressions. I fear that to get this working, you are going to have to bite the bullet and learn how it works; debugging complex regular expressions is difficult, and you won't get it right by trial and error.

How to escape regular expression characters from variable in JMeter?

Problem is simple. I have regular expression used to extract some data from response. It looks like that:
<input type="hidden" +name="reportpreset_id" +value="(\w+)" *>${reportPresetName}</td>
Problem is that variable ${reportPresetName} may contain characters used by regular expression like parenthesis or dots.
I've tried to surround this variable with \Q and \E (based on that) but apparently these markers don't work (apparently Java supports this markers so I'm confused).
When I'm adding that markers even then this expression fails for any content of ${reportPresetName} variable (even for cases when it was working without those markers).
I've checked list of functions in JMeter, but I didn't found anything useful.
Does anyone know how to escape regular expression characters in JMeter?
update:
When I'm using this \Q and \E with assertion it fails. When I'm doing a copy of regular expression from assertion log in "View Results Tree" and testing it on recorded response data it works! So it looks like some kind bug in JMeter.
Jmeter uses jakarta ORO as its regexp engine in Regexp Extractor and Regexp Tester:
http://jmeter.apache.org/usermanual/regular_expressions.html
But it uses Java Regexp Engine for search in HTML/Text Viewer.
Read:
http://jmeter.apache.org/usermanual/regular_expressions.html $20.4
Please note that ORO does not support the \Q and \E meta-characters.
[In other RE engines, these can be used to quote a portion of an RE so that the
meta-characters stand for themselves.]
A solution for you would be to add a JSR223 post processor using Groovy after regexp that extracts the var and escapes regexp chars using:
org.apache.oro.text.regex.Perl5Compiler.quotemeta(String valueToEscape)
As of upcoming version 2.9, a new function has been created to do so:
__escapeOroRegexpChars(String to escape, Variable Name)
\Q and \E work in Java, see Pattern.
In Java, we use to double the backslash characters, though, so you might need to use (\\w+) and, of course, \\Q and \\E.
I am not sure in your case, as I don't understand your context, actually (never used JMeter so far).
In case JMeter does not support \\Q and \\E (which I don't know if does...), you can write your own function/procedure, where you will split string into characters and replace each character with escaped sequence as follows:
if the character is \, then replace it with \\\\
otherwise add before the character a prefix \\
This is not the optimal method, but for sure it will work as needed.
For example for input
This is-a\string 12&$34|!`^5
you will get
\\T\\h\\i\\s\\ \\i\\s\\-\\a\\\\s\\t\\r\\i\\n\\g\\ \\1\\2\\&\\$\\3\\4\\|\\!\\`\\^\\5

Check string for email with regular expressions or other way

I've tried the following code, but it gives me nomatch.
re:run("qw#qc.com", "\b[A-Z0-9._%+-]+#[A-Z0-9.-]+\.[A-Z]{2,4}\b").
regexp i got here http://www.regular-expressions.info/email.html
EDITED:
Next doesnt work to
re:run("345345", "\b[0-9]+\b").
If you got just en email in string when that one will match
re:run("qw#qc.com", "^[A-Z0-9._%+-]+#[A-Z0-9.-]+\.[A-Z]{2,4}$").
I hesitate to answer this question, since I believe it relies on an incorrect assumption - that you can determine whether an email address is valid or not with a regular expression. See this question for more details; from a short glance I'd note that the regexp in your question doesn't accept the .museum and .рф top-level domains.
That said, you need to escape the backslashes. You want the string to contain backslashes, but in Erlang, backslashes are used inside strings to escape various characters, so any literal backslash needs to be written as \\. Try this:
3> re:run("qw#qc.com", "\\b[a-z0-9._%+-]+#[a-z0-9.-]+\\.[a-z]{2,4}\\b").
{match,[{0,9}]}
Or even better, this:
8> re:run("qw#qc.com", "\\b[a-zA-Z0-9.!#$%&'*+/=?^_`{|}~-]+#[a-zA-Z0-9-]+(?:\\.[a-zA-Z0-9-]+)*\\b").
{match,[{0,9}]}
That's the regexp used in the HTML 5 standard, modified to use \\b instead of ^ and $.
Looks like you need a case-insensitive match ?
Currently [A-Z0-9._%+-] (for example) only matches upper-case characters (plus numbers etc).
One solution is to specify [A-Za-z]. Another solution is to convert your email address to uppercase prior to matching.

Regular expression with drools

I have a string with multiline as below.
rawMessage=sysUpTimeInstance-->0:0:00:05.00
snmpTrapOID.0-->linkDown.0.0
In the drools when portion i have written the condition as below.
rawMessage matches "(?i).*linkDown(.|\n|\r)*"
but it is not working.Please provide me some pointers to handle multiline.
Its not clear to me what you want to do/achieve. Your regex looks not wrong (I don't know the drools flavour and what you want to match).
In general (.|\n|\r)* is able to match any character including newlines. In your example there is no newline after "linkDown", so what should it match there?
Maybe you need to double escape (I don't know for drools) like this: (.|\\n|\\r)*.
Another possibility is to use the singleline modifer s (Again, I don't know if drools supports this modifier). This makes the . match also newline characters, could then look something like this
rawMessage matches "(?i)(?s).*linkDown.*"
or if it should only match multiline from "linkdown" on
rawMessage matches "(?i).*linkDown(?s).*"
Drools uses standard java regular expressions. As the previous answer mention, your expression looks wrong. And yes, you need to double escape special chars like you would do in java. Just check the javadoc for the Pattern class in the java API.

exact meaning of tag filtering regex

next regular expression filters some html tags' style/src attribute.
[(?i:s\\*c\\*r\\*i\\*p\\*t)]
[(?i:e\\*x\\*p\\*r\\*e\\*s\\*s\\*i\\*o\\*n)]
Besides 'modifier span',
what is "\\*"?
Does it mean s*c*r*i*p*t ? Then, does it have any effect to filtering?
In regex, \\* means 0 or more literal \ characters. So the regexes are looking for the words script and expression, possibly with any number of backslashes between the letters, and possibly with no backslashes at all.
Some examples that would match:
s\c\r\\ipt
sc\\\\\ript
s\\\c\r\\\ip\\\t
script
As Qtax points out, the language is going to be important here. I don't recognize that regex syntax, but some require backslashes to be double-escaped: once for the primary language, and once for the regex engine. That's a hard thing to explain, but basically it means that the patterns might only match the following two inputs, depending on the programming language:
s*c*r*i*p*t
e*x*p*r*e*s*s*i*o*n
Generally, a \ character in regex escapes special characters to suppress their special meaning.i.e \n would actually equate to \n instead of newline.
Simple as that!
Just to add to the answer, the characters in question would resolve to s\*c\*r\*i\*p\*t