How to properly escape Regular Expression pattern in XSD schema? - regex

I need to fulfill a requirement to only accept values in the form of MM/DD/YYYY.
From what I've read on: https://www.w3.org/TR/xmlschema11-2/#nt-dateRep
Using
<xs:simpleType name="DATE">
<xs:restriction base="xs:date"/>
</xs:simpleType>
Is not going to work as its regex apparently is not supporting this format.
I have found and adjusted this format:
^(?:(?:(?:0?[13578]|1[02])(\/)31)\1|(?:(?:0?[1,3-9]|1[0-2])(\/)(?:29|30)\2))(?:(?:1[6-9]|[2-9]\d)?\d{2})$|^(?:0?2(\/)29\3(?:(?:(?:1[6-9]|[2-9]\d)?(?:0[48]|[2468][048]|[13579][26])|(?:(?:16|[2468][048]|[3579][26])00))))$|^(?:(?:0?[1-9])|(?:1[0-2]))(\/)(?:0?[1-9]|1\d|2[0-8])\4(?:(?:1[6-9]|[2-9]\d)?\d{2})$
To this form:
\^\(\?:\(\?:\(\?:0\?\[13578\]\|1\[02\]\)\(\\/\)31\)\1\|\(\?:\(\?:0\?\[1,3-9\]\|1\[0-2\]\)\(\\/\)\(\?:29\|30\)\2\)\)\(\?:\(\?:1\[6-9\]\|\[2-9\]\d\)\?\d{2}\)$\|\^\(\?:0\?2\(\\/\)29\3\(\?:\(\?:\(\?:1\[6-9\]\|\[2-9\]\d\)\?\(\?:0\[48\]\|\[2468\]\[048\]\|\[13579\]\[26\]\)\|\(\?:\(\?:16\|\[2468\]\[048\]\|\[3579\]\[26\]\)00\)\)\)\)$\|\^\(\?:\(\?:0\?\[1-9\]\)\|\(\?:1\[0-2\]\)\)\(\\/\)\(\?:0\?\[1-9\]\|1\d\|2\[0-8\]\)\4\(\?:\(\?:1\[6-9\]\|\[2-9\]\d\)\?\d{2}\)$
Now I no longer get invalid escaping errors in XML editors (using XML Spy), but I get this one:
invalid-escape: The given character escape is not recognized.
I have done the escape according to the XML schema specifications here:
https://www.w3.org/TR/xmlschema-2/#regexs Section F.1.1 there is an escape table.
Can anyone please help to nail this down right?
Thanks!

If you check the XSD regex syntax resources, you will notice that there is no support for non-capturing groups ((?:...)), nor backreferences (the \n like entities to refer to the text captured with capturing groups, (...)).
Since the only delimiter is /, you can get rid of the backreference completely.
Use
((((0?[13578]|1[02])/31)/|((0?[13-9]|1[0-2])/(29|30)/))((1[6-9]|[2-9]\d)?\d{2}‌​)|(0?2/29/(((1[6-9]|[2-9]\d)?(0[48]|[2468][048]|[13579][26])|((16|[2468][048]|[35‌​79][26])00))))|(0?[1-9]|1[0-2])/(0?[1-9]|1\d|2[0-8])/(1[6-9]|[2-9]\d)?\d{2})
See this regex demo
Note that acc. to regular-expressions.info:
Particularly noteworthy is the complete absence of anchors like the caret and dollar, word boundaries, and lookaround. XML schema always implicitly anchors the entire regular expression. The regex must match the whole element for the element to be considered valid.
So, you should not use ^ (start of string) and $ (end of string) in XSD regex.
The / symbol is escaped in regex flavors where it is a regex delimiter, and in XSD regex, there are no regex delimiters (as the only action is matching, and there are no modifiers: XML schemas do not provide a way to specify matching modes). So, do not escape / in XSD regex.
TESTING AT ONLINE TESTERS NOTE
If you test at regex101.com or similar sites, note that in most cases you need to escape the / if it is selected as a regex delimiter. You can safely remove the \ before / after you finished testing.

OK, so you're starting from this (I'm going to insert newlines for readability):
^(?:(?:(?:0?[13578]|1[02])(\/)31)\1|(?:(?:0?[1,3-9]|1[0-2])(\/)
(?:29|30)\2))(?:(?:1[6-9]|[2-9]\d)?\d{2})$
|^(?:0?2(\/)29\3(?:(?:(?:1[6-9]|[2-9]\d)?(?:0[48]|[2468][048]|[13579][26])|(?:(?:16|[2468][048]|[3579][26])00))))$|
^(?:(?:0?[1-9])|(?:1[0-2]))(\/)(?:0?[1-9]|1\d|2[0-8])\4(?:(?:1[6-9]|[2-9]\d)?\d{2})$
Horrendous stuff. Now, in XSD:
(a) there are no ^ and $ anchors, they aren't needed (the pattern is implicitly anchored). So take them out. You've responded by escaping them as \^ and \$ but that doesn't make sense: you don't actually want circumflexes and dollar signs in your input.
(b) XSD doesn't recognize non-capturing groups (?:xxxx). Just replace them with capturing groups - that is, remove the ?: Again, you've escaped the question marks, which doesn't make any sense at all.
(c) The \d should probably be [0-9], unless you actually want to match non-ASCII digits (e.g. Thai or Eastern Arabic digits)
(d) Slash (/) doesn't need to be escaped, and indeed can't be escaped. So replace \/ with /.
(e) I see some back-references, \1, \2, \4. XSD regexes do not allow back-references. But as far as I can see, the back-references in this regex serve no useful purpose. Most of them seem to be back-references to a group of the form (\/) which can only match a single slash, so the back-reference \1 can be simply replaced with /. Maybe they are throwbacks to some earlier form of the regex that allowed alternative delimiters but required them to be consistent.
From your attempts to fix the problems here, it seems to me that you don't have a very thorough understanding of regular expressions. I fear that to get this working, you are going to have to bite the bullet and learn how it works; debugging complex regular expressions is difficult, and you won't get it right by trial and error.

Related

Regex: extract characters from two patterns

I have the following string:
https://www.google.com/today/sunday/abcde2.hopeho.3345GETD?weatherType=RAOM&...
https://www.google.com/today/monday/jbkwe3.ho4eho.8495GETD?weatherType=WHTDSG&...
I'd like to extract jbkwe3.ho4eho.8495GETD or abcde2.hopeho.3345GETD. Anything between the {weekday}/ and the ?weatherType=.
I've tried (?<=sunday\/)$.*?(?=\?weatherType=) but it only works for the first line and I want to make it applicable to all strings regardless the value of {weekday}.
I tried (?<=\/.*\/)$.*?(?=\?weatherType=) but it didn't work. Could anyone familiar with Regex can lend some help? Thank you!
[Update]
I'm new to regex but I was experimenting it on sublime text editor via the "find" functionality which I think should be PCRE (according to this post)
Try this regex:
(?:sun|mon|tues|wednes|thurs|fri|satur)day\/\K[^?]+(?=\?weatherType)
Click for Demo
Link to Code
Explanation:
(?:sun|mon|tues|wednes|thurs|fri|satur)day - matches the day of a week i.e, sunday,monday,tuesday,wednesday,thursday,friday,saturday
\/ - matches /
\K - unmatches whatever has been matched so far and pretends that the match starts from the current position. This can be used for the PCRE.
[^?]+ - matches 1 or more occurences of any character that is not a ?
(?=\?weatherType) - the above subpattern[^?]+ will match all characters that are not ? until it reaches a position which is immediately followed by a ? followed by weatherType
To make the match case-insensitive, you can prepend the regex with (?i) as shown here
In the examples given, you actually only need to grab the characters between the last forward slash ("/") and the first question mark ("?").
You didn't mention what flavor regex (ie, PCRE, grep, Oracle, etc) you're using, and the actual syntax will vary depending on this, but in general, something like the following (Perl) replacement regex would handle the examples given:
s/.*\/([^?]*)\?.*/$1/gm
There are other (and more efficient) ways, but this will do the job.

Regex Match Commas Outside Quotes - XML Schema Variant

At first glance, this looks like a common question: I want to match on commas, but exclude commas that are in between a pair of "double quotes". However, what makes this challenging is that I need to do this with the XML Schema flavor of regex (W3C Specification).
All the solutions I could find for this involved a lookahead, which is not a feature in this flavor of regex. The closest I got was this:
(?:"[^"]*")|(,)
This avoids matching a comma inside quotes by instead matching the quotes as well as any text inside it as a separate group. One suggestion I ran into went like this:
(?:"[^"]*")(*SKIP)(*FAIL)|(,)
This would work perfectly, but again, (*SKIP) and (*FAIL) are not available in this variant of regex.
Here is a sample.
Foo,Bar,"TEST, QUOTES",,Blah
This test string should have 4 matches - each comma, excluding the one in the middle between the quotes. It should match on only the comma, not the text between them.
I'm at a loss, internet. Is this even possible with the limited tools at my disposal? My only other alternative would be much messier and probably slower if I can't get this regex to work.
With the limited ability of XML regex you won't be able to solve this problem, it's the wrong tool. I suggest using an XML parser do manipulate content as needed instead.
The XML regex flavor is mostly used for validation and Unicode properties... but not for a complex task like yours.
XML Schema regular expressions support the following:
Character classes, including shorthands, ranges and negated classes.
Character class subtraction.
The dot, which matches any character except line breaks.
Alternation and groups. Greedy quantifiers ?, *, + and {n,m}
Unicode properties and blocks
That's it.

Notepad 2 insert character after regular expression search

I am having an issue with trying to figure out how to insert some text after I perform a regex search. I know there is a replace function, but I am not looking for that option, just inserting. The text editor I am using is Notepad2, but I am willing to try this in other text editors.
Here is the example that I have.
TEST|Test2|Test3|Test4
This is what I am looking for
Test|Test2|PrefixTest3|Test4
Notice that I am trying to insert the the phrase "Prefix" after the 2nd pipe and leave everything else alone.
I can successfully query the result by using this regex:
^[^|]*\|[^|]*|
But then I do not know how I can retain everything prior and after the search point. Any ideas?
You could simply use \K inorder to discard the previously matched characters.
^[^|]*\|[^|]*\|\K
Then replace the match with the string prefix.
DEMO
You may easily do that in Notepad2 using the regex-based Replace feature.
Find:       ^\([^|]*|[^|]*|\)
Replace: \1Prefix
Details:
^ - start of a line (Notepad2 never overflows line boundaries!)
\([^|]*|[^|]*|\) - Capturing group 1 matching a sequence of:
[^|]* - zero or more chars other than |
| - a literal (yes, no escaping is necessary, both escaped and unescaped | match a literal |) pipe symbol
[^|]*| - see above, gets to the second |.
The replacement contains a \1 backreference that inserts what was captured with the capturing group 1.
NOTE that Notepad2 regex engine is very limited. Here is what the Notepad2 documentation says:
Notepad2 supports only a limited subset of regular expressions, as provided by built-in engine of the Scintilla source code editing component. The advantage is that it has a very small footprint. There's currently no plans to integrate a more advanced regular expressions engine, but this may be an option for future development.
Note: Regular expression search is limited to single lines, only.
Also, you may refer to the inline comments inside Scintilla RESearch.cxx file describing the supported syntax. Bear in mind that the regex type used in the Notepad2 S&R tool is that of POSIX and not all of the described Scintilla regex features will work in the tool.
Note that Notepad2 does not seem to support alternation and limiting quantifiers (similar to Lua patterns), but \w matches Unicode letters together with ASCII ones. Sadly, I could not make ? quantifier work.
^([^|]*\|[^|]*\|)
Try this.Replace by $1prefix.See demo.Just capture the first group and then use it for replace.The first group can be accessed by $1.
http://regex101.com/r/pQ9bV3/11

Negative lookahead alternative

For a URL pattern such as this one:
/detail.php?a=BYGhs5w8e9o&b=234844617545&h=9827a
I would like Google Analytics to match only the URL's with the a and b parameters in it:
/orderdetail.php?a=BYGhs5w8e9o&b=234844617545
And thus strip out:
&h=9827a
The main goal is to be able to setup a goal in Google Analytics which covers only the a and b parameters and ignores the h parameter.
Is there an easy way to accomplish this without a negative lookahead?
Standard regular expressions do not need negative lookahead for this. Just do a match and replace. Searching for:
(/detail.php\?a=\w+&b=\w+)&h=\w+
and replacing with \1 works with the regular expressions in Notepad++ version 6.5.5. Google's regular expressions may be subtly different.
The above works by surrounding the wanted text with capturing braces and leaving the unwanted part outside. The ? needs escaping as un-escaped it means the previous item (ie the p) is optional. The \w sequence mean any "word" character so \w+ means a word.

Regex matching terminating quote only if quote at the beginning

I want to match the following element with regex
target="#MOBILE"
and all valid variants.
I've written the regex
target[\s\S]*#MOBILE[^>^\s]*
which matches the following
target="#MOBILE"
target = "#MOBILE"
target=#MOBILE
target="#MOBILE" (followed directly by >)
but it doesn't match
target=" #MOBILE "
properly (note the extra space). It only matches
target=" #MOBILE
missing out the final quote
What I need is the terminating expression [^>^\s]* to match a quote only if it matches a quote at the beginning. It also needs to work with single quotes. The terminating expression also needs to end with a whitespace or > char as it does currently.
I'm sure there is a way to do this - but I'm not sure how. It's probably standard stuff - I just don't know it
Incidently I'm not sure that [^>^\s]* is the best way to terminate if the regex hits a space or > char but it's the only way that I can get it to work.
You can use a backreference, similar to jensgram's suggestion:
target\s*=\s*(?:(")\s*)?#Mobile\s*\1
(?:(")\s*)? - Optional non-capturing group that contains a quote (which is captured), and additional optional spaces. If it matched, \1 will contain a quote.
Working example: http://regexr.com?2vkkq
A better alternative for .Net (mainly because you want single quotes, and \1 behaves differently for uncaptured groups):
target\s*=\s*(["']?)\s*?#Mobile\s*\1
Working example: Regex Storm
Try the following if you need to check that your quotes are in pairs:
target\s*=\s*(['"])(?=\1)\s*#MOBILE\s*(?<=\1)\1
But it really depends if your regex engine supports positive look-(ahead|behind) syntax. And if it supports back-referencing.
Without quotes target\s*=\s*#MOBILE
With double quotes target\s*=\s*"\s*#MOBILE\s*"
With single quotes target\s*=\s*'\s*#MOBILE\s*'
All together
(target\s*=\s*#MOBILE)|(target\s*=\s*"\s*#MOBILE\s*")|(target\s*=\s*'\s*#MOBILE\s*')
Or someone can make it neater.