Is the behavior of XSLT/XQuery regex output implementation-dependent? - regex

Using the regular expression specifications defined for XPath and XQuery, is it possible for two different implementations of fn:analyze-string, given as inputs the same regex and match strings, to return different results and still be considered conforming to the W3C Recommendation? Or should the same inputs always return the same results across different XQuery and XSLT processors?
Specifically, I am asking about the content of match, non-match, group, and #nr values, not the base URIs or node identities (which are clearly defined as implementation dependent).

There are one or two very minor aspects in which the spec is implementation-dependent:
The vendor is allowed to decide which version of Unicode to adopt as the baseline. There are some changes between versions of Unicode, for example changes to character categories, that can affect the outcome of expressions like \p{Cn} or \p{IsGreek}, or the question of whether two characters are considered case-variants of each other.
The rules for captured substrings are not quite precise in edge cases. The spec gives an example: For example given the regular expression (a*)+ and the input string "aaaa", an implementation might legitimately capture either "aaaa" or a zero length string as the content of the captured subgroup.
Beyond that, the results should be the same across processors. But of course, this is one area where processors might decide that 100% conformance is just too hard - for example in Saxon-JS, we decided to do the best we could using the Javascript 6 regex engine, which certainly leaves us short of 100% conformance with the XPath rules.

One must distinguish between three aspects of the terminology that are crucial:
Nondeterminism, which means that the same function/expression may return different results when evaluated several times with the same parameters/context (with the same implementation, in the same query).
Implementation-dependent behavior, which means that implementations may behave differently for a specific feature (but this does not mean that it cannot be deterministic within the same implementation).
Implementation-defined behavior, which is the same as implementation-dependent behavior, except that the implementation must document its behavior precisely so users can rely on it.
My understanding from the XQuery specification, but also from the XML Schema specification which defines the regular expression language, is that two implementations must return the same results to a call to fn:analyze-string, considerations on the enclosing element nodes left aside.
The XQuery specification says that the nondeterminism of fn:analyze-string is only due, as mentioned in the question, to the fact that the node identity may or may not be the same across repeated and identical calls.
The base URI and prefixes are implementation-dependent, and my understanding is that it is still implicitly meant that they must be chosen deterministically within a query.
Unless I overlooked something, the XML Schema specification does not seem to give any leeway to implementors on regular expressions. XQuery extends XML Schema regular expressions, but the only implementation-dependent feature is the capturing of some groups, which is only relevant for replacements.

Related

What's an "additional tie breaker" for Perl 6 longest token matching?

The docs for Perl 6 longest alternation in regexes punt to Synopsis 5 to document the rules for longest token matching. There are three rules if different alternatives would match a substring of the same length:
The longest declarative prefix breaks the tie
The highest specificity breaks the tie
"If it's still a tie, use additional tie-breakers."
The left most alternation finally wins
It's that third rule that I'm curious about.
First the way the text is organized makes clear that the behaviour of the implementation must be deterministic (not random).
Second - and more important - describing the exact behaviour of existing implementations could fill an entire, hard-to-understand page as every corner case has to be described. In addition such a specification would limit degrees of freedom of the implementation. Let's assume some implementation supports a "fastest implementation" flag. Such an implementation can use unspecified parts to make short-cuts. So leaving the behaviour unspecified resp. restricted to the minimum has some advantages.

Why is there no definition for std::regex_traits<char32_t> (and thus no std::basic_regex<char32_t>) provided?

I would like to use regular expressions on UTF-32 codepoints and found this reference stating that std::regex_traits has to be defined by the user, so that std::basic_regex can be used at all. There seems to be no changes planned in the future for this.
Why is this even the case?
Does this have to do with the fact that Unicode says combined codepoint have to be treated equal to the single-code point representation (like the umlaut 'ä' represented as a single codepoint or with the a and the dots as two separate ones) ?
Given the simplification that only single-codepoint characters would be supported, could this trait be defined easily or would this be either non-trivial nevertheless or require further limitations?
Some aspects of regex matching are locale-aware, with the result that a std::regex_traits object includes or references an instance of a std::locale object. The C++ standard library only provides locales for char and wchar_t characters, so there is no standard locale for char32_t (unless it happens to be the same as wchar_t), and this restriction carries over into regexes.
Your description is imprecise. Unicode defines canonical equivalence relationship between two strings, which is based on normalizing the two strings, using either NFC or NFD, and then codepoint-by-codepoint comparing the normalized values. It does not defined canonical equivalence simply as an equivalence between a codepoint and a codepoint sequence, because normalization cannot simply be done character-by-character. Normalisation may require reordering composing characters into the canonical order (after canonical (de)composition). As such, it does not easily fit into the C++ model of locale transformations, which are generally single-character.
The C++ standard library does not implement any Unicode normalization algorithm; in C++, as in many other languages, the two strings L"\u00e4" (ä) and L"\u0061\u0308" (ä) will compare as different, although they are canonically equivalent, and look to the human reader like the same grapheme. (On the machine I'm writing this answer, the rendering of those two graphemes is subtly different; if you look closely, you'll see that the umlaut in the second one is slightly displaced from its visually optimal position. That violates the Unicode requirement that canonically equivalent string have precisely the same rendering.)
If you want to check for canonical equivalence of two strings, you need to use a Unicode normalisation library. Unfortunately, the C++ standard library does not include any such API; you could look at ICU (which also includes Unicode-aware regex matching).
In any case, regular expression matching -- to the extent that it is specified in the C++ standard -- does not normalize the target string. This is permitted by the Unicode Technical Report on regular expressions, which recommends that the target string be explicitly normalized to some normalization form and the pattern written to work with strings normalized to that form:
For most full-featured regular expression engines, it is quite difficult to match under canonical equivalence, which may involve reordering, splitting, or merging of characters.… In practice, regex APIs are not set up to match parts of characters or handle discontiguous selections. There are many other edge cases… It is feasible, however, to construct patterns that will match against NFD (or NFKD) text. That can be done by:
Putting the text to be matched into a defined normalization form (NFD or NFKD).
Having the user design the regular expression pattern to match against that defined normalization form. For example, the pattern should contain no characters that would not occur in that normalization form, nor sequences that would not occur.
Applying the matching algorithm on a code point by code point basis, as usual.
The bulk of the work in creating a char32_t specialization of std::regex_traits would be creating a char32_t locale object. I've never tried doing either of these things; I suspect it would require a fair amount of attention to detail, because there are a lot of odd corner cases.
The C++ standard is somewhat vague about the details of regular expression matching, leaving the details to external documentation about each flavour of regular expression (and without a full explanation about how to apply such external specifications to character types other than the one each flavour is specified on). However, the fact that matching is character-by-character is possible to deduce. For example, in § 28.3, Requirements [re.req], Table 136 includes the locale method responsible for the character-by-character equivalence algorithm:
Expression: v.translate(c)
Return type: X::char_type
Assertion: Returns a character such that for any character d that is to be considered equivalent to c then v.translate(c) == v.translate(d).
Similarly, in the description of regular expression matching for the default "Modified ECMAScript" flavour (§ 28.13), the standard describes how the regular expression engine to matches two characters (one in the pattern and one in the target): (paragraph 14.1):
During matching of a regular expression finite state machine against a sequence of characters, two characters c and d are compared using the following rules:
if (flags() & regex_constants::icase) the two characters are equal if traits_inst.translate_nocase(c) == traits_inst.translate_nocase(d);
otherwise, if flags() & regex_constants::collate the two characters are equal if traits_inst.translate(c) == traits_inst.translate(d);
otherwise, the two characters are equal if c == d.
I've just discovered a regex implementation which supports char32_t: http://www.akenotsuki.com/misc/srell/en/
It mimics std::regex API and is under BSD license.

Is there an algorithm for determining if the set of all valid XML instances in respect with a specific XSD schema is a regular language or not?

Essentially I want to know if a specific XSD schema can be replaced by a regular expression or not. I know that XML Schema language can produce XSDs whose set of valid XML instances can be of any type of language (even context-sensitive). I want to identify those schemas that are "regex-equivalent". I came up with this question after tackling the following problem:
I needed to parse a specific text format and I first tried regular expressions and I saw that regexp is sufficient to parse it. I then wanted to make an XML representation for the messages that I received in this format so I mapped regex groups with XML elements. I then created manually an XSD schema based on the structure of the regex. In the end , I had a schema that could replace my regex, in the sense that the original regex was possible to be constructed from the schema. I also managed to do the opposite: Create the schema automatically from the regex. So I could transform the message into XML and validate it on the same time. My questions are:
Can every regex be represented by an XSD schema? (I mean, given a regex to be able to produce an XSD schema)
Given an arbitrary XSD schema is there a way to determine if there is a regex whose representation is the given schema?
EDIT: Probably the answer to 1st question is yes since i did it with my regex in a way that did not depend on the specific regex (This isn't a proof for every regex).
XML Schema language is a super-set of regular languages, but only within the domain of XML documents, obviously.
For #1: with the added condition that the regex matches a well-formed XML document and nothing else, yes.
For #2: yes, it's a matter of checking for any features of XSD which are allowed in a regular language. Finding the regular expression would be much more work.
A regular language has a fairly simple definition, informally:
The empty set/string
Literals (a "singleton language"), e.g., "x"
For a regular language A, A* is also a regular language
For regular languages A and B, A|B (union) and AB (concatenated) are regular.
Basically, all concatenations and alternations are fine but recursion is impossible and there's no back-references or "memory". No element type can contain choice/all/element elements referencing itself or parent types, and you can't make use of any information you found earlier in the parsing process.
The restriction on recursion extends to the any element, which would be forbidden. By definition it accepts any element, including elements with sub-elements. Since you don't know the nesting depth of this unknown element you need a recursive pattern to match it, and you can't do that in a regular language.
The restriction on back-references means you can't do things like "some number of 'A' followed by the same number of 'B'" (A{n}B{n}). I don't think this is even possible in XSD, however, at least I can't think how you would do it.
Restricting numeric values (e.g., minInclusive) would not be possible in a regex.
The all element would be problematic in that it would have to accept all the possible orderings of child elements, which would make the regex expand exponentially (binomial coefficient, (n/k)^k <= n!/k!(n-k)! <= (ne/k)^k) with the number of child elements, and matching the regex is super-linear on that length. Recognizing attributes suffers from the same issue, since the ordering of attributes within an element is not constrained by the schema. Of course, if you only care about whether a regex exists and not about finding it, then it doesn't matter.

DFA based regular expression matching - how to get all matches?

I have a given DFA that represent a regular expression.
I want to match the DFA against an input stream and get all possible matches back, not only the leastmost-longest match.
For example:
regex: a*ba|baa
input: aaaaabaaababbabbbaa
result:
aaaaaba
aaba
ba
baa
Assumptions
Based on your question and later comments you want a general method for splitting a sentence into non-overlapping, matching substrings, with non-matching parts of the sentence discarded. You also seem to want optimal run-time performance. Also I assume you have an existing algorithm to transform a regular expression into DFA form already. I further assume that you are doing this by the usual method of first constructing an NFA and converting it by subset construction to DFA, since I'm not aware of any other way of accomplishing this.
Before you go chasing after shadows, make sure your trying to apply the right tool for the job. Any discussion of regular expressions is almost always muddied by the fact that folks use regular expressions for a lot more things than they are really optimal for. If you want to receive the benefits of regular expressions, be sure you're using a regular expression, and not something broader. If what you want to do can't be somewhat coded into a regular expression itself, then you can't benefit from the advantages of regular expression algorithms (fully)
An obvious example is that no amount of cleverness will allow a FSM, or any algorithm, to predict the future. For instance, an expression like (a*b)|(a), when matched against the string aaa... where the ellipsis is the portion of the expression not yet scanned because the user has not typed them yet, cannot give you every possible right subgroup.
For a more detailed discussion of Regular expression implementations, and specifically Thompson NFA's please check this link, which describes a simple C implementation with some clever optimizations.
Limitations of Regular Languages
The O(n) and Space(O(1)) guarantees of regular expression algorithms is a fairly narrow claim. Specifically, a regular language is the set of all languages that can be recognized in constant space. This distinction is important. Any kind of enhancement to the algorithm that does something more sophisticated than accepting or rejecting a sentence is likely to operate on a larger set of languages than regular. On top of that, if you can show that some enhancement requires greater than constant space to implement, then you are also outside of the performance guarantee. That being said, we can still do an awful lot if we are very careful to keep our algorithm within these narrow constraints.
Obviously that eliminates anything we might want to do with recursive backtracking. A stack does not have constant space. Even maintaining pointers into the sentence would be verboten, since we don't know how long the sentence might be. A long enough sentence would overflow any integer pointer. We can't create new states for the automaton as we go to get around this. All possible states (and a few impossible ones) must be predictable before exposing the recognizer to any input, and that quantity must be bounded by some constant, which may vary for the specific language we want to match, but by no other variable.
This still allows some room for adding additonal behavior. The usual way of getting more mileage is to add some extra annotations for where certain events in processing occur, such as when a subexpression started or stopped matching. Since we are only allowed to have constant space processing, that limits the number of subexpression matches we can process. This usually means the latest instance of that subexpression. This is why, when you ask for the subgrouped matched by (a|)*, you always get an empty string, because any sequence of a's is implicitly followed by infinitely many empty strings.
The other common enhancement is to do some clever thing between states. For example, in perl regex, \b matches the empty string, but only if the previous character is a word character and the next is not, or visa versa. Many simple assertions fit this, including the common line anchor operators, ^ and $. Lookahead and lookbehind assertions are also possible, but much more difficult.
When discussing the differences between various regular language recognizers, it's worth clarifying if we're talking about match recognition or search recognition, the former being an accept only if the entire sentence is in the language, and the latter accepts if any substring in the sentence is in the language. These are equivalent in the sense that if some expression E is accepted by the search method, then .*(E).* is accepted in the match method.
This is important because we might want to make it clear whether an expression like a*b|a accepts aa or not. In the search method, it does. Either token will match the right side of the disjunction. It doesn't match, though, because you could never get that sentence by stepping through the expression and generating tokens from the transitions, at least in a single pass. For this reason, i'm only going to talk about match semantics. Obviously if you want search semantics, you can modify the expression with .*'s
Note: A language defined by expression E|.* is not really a very manageable language, regardless of the sublanguage of E because it matches all possible sentences. This represents a real challenge for regular expression recognizers because they are really only suited to recognizing a language or else confirming that a sentence is not in that same language, rather than doing any more specific work.
Implementation of Regular Language Recognizers
There are generally three ways to process a regular expression. All three start the same, by transforming the expression into an NFA. This process produces one or two states for each production rule in the original expression. The rules are extemely simple. Here's some crude ascii art: note that a is any single literal character in the language's alphabet, and E1 and E2 are any regular expression. Epsilon(ε) is a state with inputs and outputs, but ignores the stream of characters, and doesn't consume any input either.
a ::= > a ->
E1 E2 ::= >- E1 ->- E2 ->
/---->
E1* ::= > --ε <-\
\ /
E1
/-E1 ->
E1|E2 ::= > ε
\-E2 ->
And that's it! Common uses such as E+, E?, [abc] are equivalent to EE*, (E|), (a|b|c) respectively. Also note that we add for each production rule a very small number of new states. In fact each rule adds zero or one state (in this presentation). characters, quantifiers and dysjunction all add just one state, and the concatenation doesn't add any. Everything else is done by updating the fragments' end pointers to start pointers of other states or fragments.
The epsilon transition states are important, because they are ambiguous. When encountered, is the machine supposed to change state to once following state or another? should it change state at all or stay put? That's the reason why these automatons are called nondeterministic. The solution is to have the automaton transition to the right state, whichever allows it to match the best. Thus the tricky part is to figure out how to do that.
There are fundamentally two ways of doing this. The first way is to try each one. Follow the first choice, and if that doesn't work, try the next. This is recursive backtracking, appears in a few (and notable) implementations. For well crafted regular expressions, this implementation does very little extra work. If the expression is a bit more convoluted, recursive backtracking is very, very bad, O(2^n).
The other way of doing this is to instead try both options in parallel. At each epsilon transition, add to the set of current states both of the states the epsilon transition suggests. Since you are using a set, you can have the same state come up more than once, but you only need to track it once, either you are in that state or not. If you get to the point that there's no option for a particular state to follow, just ignore it, that path didn't match. If there are no more states, then the entire expression didn't match. as soon as any state reaches the final state, you are done.
Just from that explanation, the amount of work we have to do has gone up a little bit. We've gone from having to keep track of a single state to several. At each iteration, we may have to update on the order of m state pointers, including things like checking for duplicates. Also the amount of storage we needed has gone up, since now it's no longer a single pointer to one possible state in the NFA, but a whole set of them.
However, this isn't anywhere close to as bad as it sounds. First off, the number of states is bounded by the number of productions in the original regular expression. From now on we'll call this value m to distinguish it from the number of symbols in the input, which will be n. If two state pointers end up transitioning to the same new state, you can discard one of them, because no matter what else happens, they will both follow the same path from there on out. This means the number of state pointers you need is bounded by the number of states, so that to is m.
This is a bigger win in the worst case scenario when compared to backtracking. After each character is consumed from the input, you will create, rename, or destroy at most m state pointers. There is no way to craft a regular expression which will cause you to execute more than that many instructions (times some constant factor depending on your exact implementation), or will cause you to allocate more space on the stack or heap.
This NFA, simultaneously in some subset of its m states, may be considered some other state machine who's state represents the set of states the NFA it models could be in. each state of that FSM represents one element from the power set of the states of the NFA. This is exactly the DFA implementation used for matching regular expressions.
Using this alternate representation has an advantage that instead of updating m state pointers, you only have to update one. It also has a downside, since it models the powerset of m states, it actually has up to 2m states. That is an upper limit, because you don't model states that cannot happen, for instance the expression a|b has two possible states after reading the first character, either the one for having seen an a, or the one for having seen a b. No matter what input you give it, it cannot be in both of those states at the same time, so that state-set does not appear in the DFA. In fact, because you are eliminating the redundancy of epsilon transitions, many simple DFA's actually get SMALLER than the NFA they represent, but there is simply no way to guarantee that.
To keep the explosion of states from growing too large, a solution used in a few versions of that algorithm is to only generate the DFA states you actually need, and if you get too many, discard ones you haven't used recently. You can always generate them again.
From Theory to Practice
Many practical uses of regular expressions involve tracking the position of the input. This is technically cheating, since the input could be arbitrarily long. Even if you used a 64 bit pointer, the input could possibly be 264+1 symbols long, and you would fail. Your position pointers have to grow with the length of the input, and now your algorithm now requires more than constant space to execute. In practice this isn't relevant, because if your regular expression did end up working its way through that much input, you probably won't notice that it would fail because you'd terminate it long before then.
Of course, we want to do more than just accept or reject inputs as a whole. The most useful variation on this is to extract submatches, to discover which portion of an input was matched by a certain section of the original expression. The simple way to achieve this is to add an epsilon transition for each of the opening and closing braces in the expression. When the FSM simulator encounters one of these states, it annotates the state pointer with information about where in the input it was at the time it encountered that particular transition. If the same pointer returns to that transition a second time, the old annotation is discarded and replaced with a new annotation for the new input position. If two states pointers with disagreeing annotations collapse to the same state, the annotation of a later input position wins again.
If you are sticking to Thompson NFA or DFA implementations, then there's not really any notion of greedy or non-greedy matching. A backtracking algorithm needs to be given a hint about whether it should start by trying to match as much as it can and recursively trying less, or trying as little as it can and recursively trying more, when it fails it first attempt. The Thompson NFA method tries all possible quantities simultaneously. On the other hand, you might still wish to use some greedy/nongreedy hinting. This information would be used to determine if newer or older submatch annotations should be preferred, in order to capture just the right portion of the input.
Another kind of practical enhancement is assertions, productions which do not consume input, but match or reject based on some aspect of the input position. For instance in perl regex, a \b indicates that the input must contain a word boundary at that position, such that the symbol just matched must be a word character, but the next character must not be, or visa versa. Again, we manage this by adding an epsilon transition with special instructions to the simulator. If the assertion passes, then the state pointer continues, otherwise it is discarded.
Lookahead and lookbehind assertions can be achieved with a bit more work. A typical lookbehind assertion r0(?<=r1)r2 is transformed into two separate expressions, .*r1 and r0εr2. Both expressions are applied to the input. Note that we added a .* to the assertion expression, because we don't actually care where it starts. When the simulator encounters the epsilon in the second generated fragment, it checks up on the state of the first fragment. If that fragment is in a state where it could accept right there, the assertion passes with the state pointer flowing into r2, but otherwise, it fails, and both fragments continue, with the second discarding the state pointer at the epsilon transition.
Lookahead also works by using an extra regex fragment for the assertion, but is a little more complex, because when we reach the point in the input where the assertion must succeed, none of the corresponding characters have been encountered (in the lookbehind case, they have all been encountered). Instead, when the simulator reaches the assertion, it starts a pointer in the start state of the assertion subexpression and annotates the state pointer in the main part of the simulation so that it knows it is dependent on the subexpression pointer. At each step, the simulation must check to see that the state pointer it depends upon is still matching. If it doesn't find one, then it fails wherever it happens to be. You don't have to keep any more copies of the assertion subexpressions state pointers than you do for the main part, if two state pointers in the assertion land on the same state, then the state pointers each of them depend upon will share the same fate, and can be reannotated to point to the single pointer you keep.
While were adding special instructions to epsilon transitions, it's not a terrible idea to suggest an instruction to make the simulator pause once in a while to let the user see what's going on. Whenever the simulator encounters such a transition, it will wrap up its current state in some kind of package that can be returned to the caller, inspected or altered, and then resumed where it left off. This could be used to match input interactively, so if the user types only a partial match, the simulator can ask for more input, but if the user types something invalid, the simulator is empty, and can complain to the user. Another possibility is to yield every time a subexpression is matched, allowing you to peek at every sub match in the input. This couldn't be used to exclude some submatches, though. For instance, if you tried to match ((a)*b) against aaa, you could see three submatches for the a's, even though the whole expression ultimately fails because there is no b, and no submatch for the corresponding b's
Finally, there might be a way to modify this to work with backreferences. Even if it's elegent, it's sure to be inefficient, specifically, regular expressions plus backreferences are in NP-Complete, so I won't even try to think of a way to do this, because we are only interested (here) in (asymptotically) efficient possibilities.

Regular expressions Lexical Analysis

Why repeated strings such as
[wcw|w is a string of a's and b's]
cannot be denoted by regular expressions?
pls. give me detailed answer as i m new to lexical analysis.
Thanks ...
Regular expressions in their original form describe regular languages/grammars. Those cannot contain nested structures as those languages can be described by a simple finite state machine. Simplified you can picture that as if each word of the language grows strictly from left to right (or right to left), where repeating structures have to be explicitly defined and are static.
What this means is, that no information whatsoever from previous states can be carried over to later states (a few characters further in the input). So if you have your symbol w you can't specify that the input must have exactly the same string w later in the sequence. Similarly you can't ensure that each opening paranthesis needs a closin paren as well (so regular expressions themselves are not even a regular language and thus cannot be described by regular expressions :-)).
In theoretical computer science we worked with a very restricted set of regex operators, basically only consisting of sequence, alternative (|) and repetition (*), everything else can be described with those operations.
However, usually regex engines allow grouping of certain sub-patterns into matches which can then be referenced or extracted later. Some engines even allow to use such a backreference in the search expression string itself, thereby allowing the expression to describe more than just a regular language. If I remember correctly such use of backreferences can even yield languages that are not context-free.
Additional pointers:
This StackOverflow question
Wikipedia
It can be, you just can't assure that it's the same string of "a"s and "b"s because there's no way to retain the information acquired in traversing the first half for use in traversing the second.