LEX Pattern for matching compressed textual representation of an IP version 6 address - regex

I am aware that there are lots of post on stack overflow and elsewhere of regular expressions, including LEX patterns for IPV6 addresses. None of them appear to be truly complete and indeed some requirements do not need to parse all possible Address Formats.
I am looking for a LEX pattern for IP version 6 address only for addresses represented in compressed textual form. This form is described in Section 2.2 of RFC 5952 (and possibly other related RFCs) and represents a relatively small subset of all possible IPv6 address formats.
If anyone has one which is well tested or is aware of one, please forward it.

RFC 5952 §2.2 does not formally describe the compressed IPv6 address form. The goal of RFC 5952 is to produce a "canonical textual representation form"; that is, a set of textual encodings which has a one-to-one relationship with the set of IPv6 addresses. Section 2.2 enumerates a few aspects of the compressed form which lead to encoding options; a canonical representation needs to eliminate all options.
The compressed syntax is actually described in clause 2 of RFC 4291 §2.2. That syntax is easy enough to describe as a regular expression, although it's a little annoying; it would be easier in a syntax which includes the intersection of two regular expressions (Ragel provides that operator, for example), but in this case a simple enumeration of possibilities suffices.
If you really want to limit the matches to the canonical representations listed in RFC 5952 §4.2, then you have a slightly more daunting task because of the requirement that the compressed run of 0s must be the longest run of 0s in the uncompressed address, or the first such run if there is more than one longest run of the same length.
That would be possible by making a much longer enumeration of permissible forms where the compressed run satisfies the "first longest" constraint. But I'm really not sure that there is any value in creating that monster, since RFC 5952 is quite clear that the intent is to restrict the set of representations produced by a conforming application (emphasis added):
…[A]ll implementations MUST accept and be able to handle any legitimate RFC4291 format.
Since regular expressions are mostly of use in recognising and parsing inputs, it seems unnecessary to go to the trouble of writing and verifying the list of possible canonical patterns.
An IPv6 address conforming to clause 1 of RFC 4291 §2.2 can easily be described in lex syntax:
piece [[:xdigit:]]{1,4}
%%
{piece}(:{piece}){7} { /* an uncompressed IPv6 address */ }
In passing, although it seems unnecessary for the same reasons noted above, it's very simple to restrict {piece} to the canonical 16-bit representations (lower-case only, no leading zeros):
piece 0|[1-9a-f][0-9a-f]{0,3}
The complication comes with the requirement in clause 2 that only one run of 0s be compressed. It's easy to write a regular expression which allows only one number to be omitted:
(({piece}:)*{piece})?::({piece}(:{piece})*)?
but that formulation no longer limits the number of pieces to 8. It's also fairly easy to write a regular expression which allows omitted pieces, limiting the number of fields:
{piece}(:{piece}?){1,6}:{piece}|:(:{piece}){1,7}|({piece}:){1,7}:|::
What's desired is the intersection of those two patterns, plus the pattern for uncompressed addresses. But, as mentioned, there's no way of writing intersections in (f)lex. So we end up enumerating possibilities. A simple enumeration is the number of initial uncompressed pieces:
(?x: /* Flex's extended syntax allows whitespace and continuation lines */
{piece}(:{piece}){7}
| {piece} ::{piece}(:{piece}){0,5}
| {piece}:{piece} ::{piece}(:{piece}){0,4}
| {piece}(:{piece}){2}::{piece}(:{piece}){0,3}
| {piece}(:{piece}){3}::{piece}(:{piece}){0,2}
| {piece}(:{piece}){4}::{piece}(:{piece})?
| {piece}(:{piece}){5}::{piece}
| {piece}(:{piece}){0,6}::
| ::{piece}(:{piece}){0,6}
| ::
)
That still excludes the various forms of embedding IPv4 addresses in IPv6, but it should be clear how to add those, if desired.

Related

Is the behavior of XSLT/XQuery regex output implementation-dependent?

Using the regular expression specifications defined for XPath and XQuery, is it possible for two different implementations of fn:analyze-string, given as inputs the same regex and match strings, to return different results and still be considered conforming to the W3C Recommendation? Or should the same inputs always return the same results across different XQuery and XSLT processors?
Specifically, I am asking about the content of match, non-match, group, and #nr values, not the base URIs or node identities (which are clearly defined as implementation dependent).
There are one or two very minor aspects in which the spec is implementation-dependent:
The vendor is allowed to decide which version of Unicode to adopt as the baseline. There are some changes between versions of Unicode, for example changes to character categories, that can affect the outcome of expressions like \p{Cn} or \p{IsGreek}, or the question of whether two characters are considered case-variants of each other.
The rules for captured substrings are not quite precise in edge cases. The spec gives an example: For example given the regular expression (a*)+ and the input string "aaaa", an implementation might legitimately capture either "aaaa" or a zero length string as the content of the captured subgroup.
Beyond that, the results should be the same across processors. But of course, this is one area where processors might decide that 100% conformance is just too hard - for example in Saxon-JS, we decided to do the best we could using the Javascript 6 regex engine, which certainly leaves us short of 100% conformance with the XPath rules.
One must distinguish between three aspects of the terminology that are crucial:
Nondeterminism, which means that the same function/expression may return different results when evaluated several times with the same parameters/context (with the same implementation, in the same query).
Implementation-dependent behavior, which means that implementations may behave differently for a specific feature (but this does not mean that it cannot be deterministic within the same implementation).
Implementation-defined behavior, which is the same as implementation-dependent behavior, except that the implementation must document its behavior precisely so users can rely on it.
My understanding from the XQuery specification, but also from the XML Schema specification which defines the regular expression language, is that two implementations must return the same results to a call to fn:analyze-string, considerations on the enclosing element nodes left aside.
The XQuery specification says that the nondeterminism of fn:analyze-string is only due, as mentioned in the question, to the fact that the node identity may or may not be the same across repeated and identical calls.
The base URI and prefixes are implementation-dependent, and my understanding is that it is still implicitly meant that they must be chosen deterministically within a query.
Unless I overlooked something, the XML Schema specification does not seem to give any leeway to implementors on regular expressions. XQuery extends XML Schema regular expressions, but the only implementation-dependent feature is the capturing of some groups, which is only relevant for replacements.

Why is there no definition for std::regex_traits<char32_t> (and thus no std::basic_regex<char32_t>) provided?

I would like to use regular expressions on UTF-32 codepoints and found this reference stating that std::regex_traits has to be defined by the user, so that std::basic_regex can be used at all. There seems to be no changes planned in the future for this.
Why is this even the case?
Does this have to do with the fact that Unicode says combined codepoint have to be treated equal to the single-code point representation (like the umlaut 'ä' represented as a single codepoint or with the a and the dots as two separate ones) ?
Given the simplification that only single-codepoint characters would be supported, could this trait be defined easily or would this be either non-trivial nevertheless or require further limitations?
Some aspects of regex matching are locale-aware, with the result that a std::regex_traits object includes or references an instance of a std::locale object. The C++ standard library only provides locales for char and wchar_t characters, so there is no standard locale for char32_t (unless it happens to be the same as wchar_t), and this restriction carries over into regexes.
Your description is imprecise. Unicode defines canonical equivalence relationship between two strings, which is based on normalizing the two strings, using either NFC or NFD, and then codepoint-by-codepoint comparing the normalized values. It does not defined canonical equivalence simply as an equivalence between a codepoint and a codepoint sequence, because normalization cannot simply be done character-by-character. Normalisation may require reordering composing characters into the canonical order (after canonical (de)composition). As such, it does not easily fit into the C++ model of locale transformations, which are generally single-character.
The C++ standard library does not implement any Unicode normalization algorithm; in C++, as in many other languages, the two strings L"\u00e4" (ä) and L"\u0061\u0308" (ä) will compare as different, although they are canonically equivalent, and look to the human reader like the same grapheme. (On the machine I'm writing this answer, the rendering of those two graphemes is subtly different; if you look closely, you'll see that the umlaut in the second one is slightly displaced from its visually optimal position. That violates the Unicode requirement that canonically equivalent string have precisely the same rendering.)
If you want to check for canonical equivalence of two strings, you need to use a Unicode normalisation library. Unfortunately, the C++ standard library does not include any such API; you could look at ICU (which also includes Unicode-aware regex matching).
In any case, regular expression matching -- to the extent that it is specified in the C++ standard -- does not normalize the target string. This is permitted by the Unicode Technical Report on regular expressions, which recommends that the target string be explicitly normalized to some normalization form and the pattern written to work with strings normalized to that form:
For most full-featured regular expression engines, it is quite difficult to match under canonical equivalence, which may involve reordering, splitting, or merging of characters.… In practice, regex APIs are not set up to match parts of characters or handle discontiguous selections. There are many other edge cases… It is feasible, however, to construct patterns that will match against NFD (or NFKD) text. That can be done by:
Putting the text to be matched into a defined normalization form (NFD or NFKD).
Having the user design the regular expression pattern to match against that defined normalization form. For example, the pattern should contain no characters that would not occur in that normalization form, nor sequences that would not occur.
Applying the matching algorithm on a code point by code point basis, as usual.
The bulk of the work in creating a char32_t specialization of std::regex_traits would be creating a char32_t locale object. I've never tried doing either of these things; I suspect it would require a fair amount of attention to detail, because there are a lot of odd corner cases.
The C++ standard is somewhat vague about the details of regular expression matching, leaving the details to external documentation about each flavour of regular expression (and without a full explanation about how to apply such external specifications to character types other than the one each flavour is specified on). However, the fact that matching is character-by-character is possible to deduce. For example, in § 28.3, Requirements [re.req], Table 136 includes the locale method responsible for the character-by-character equivalence algorithm:
Expression: v.translate(c)
Return type: X::char_type
Assertion: Returns a character such that for any character d that is to be considered equivalent to c then v.translate(c) == v.translate(d).
Similarly, in the description of regular expression matching for the default "Modified ECMAScript" flavour (§ 28.13), the standard describes how the regular expression engine to matches two characters (one in the pattern and one in the target): (paragraph 14.1):
During matching of a regular expression finite state machine against a sequence of characters, two characters c and d are compared using the following rules:
if (flags() & regex_constants::icase) the two characters are equal if traits_inst.translate_nocase(c) == traits_inst.translate_nocase(d);
otherwise, if flags() & regex_constants::collate the two characters are equal if traits_inst.translate(c) == traits_inst.translate(d);
otherwise, the two characters are equal if c == d.
I've just discovered a regex implementation which supports char32_t: http://www.akenotsuki.com/misc/srell/en/
It mimics std::regex API and is under BSD license.

Is there a dataset available to fully test a base64 encode/decoder?

I see that there are many base64 implementations available in the opensource and I found multiple internal implementations in a product that I am maintaining.
I'm trying to factor out duplicates but I am not 100% certain that all these implementations give identical output. Therfore I need to have a dataset that tests all possible combinations of input.
Is that somewhere available ? google search did not really report it.
I saw a similar question on stackoverflow but that one has not been fully answered and it is actually just asking for one phrase (in ascii) that would test all 64 chars. It does not handle padding with = for example. So one test string will certainly not fit the bill for a 100% test.
Perhaps something like Base64Test in Bouncy Castle would do what you want?. The tricky part in base64 is handling the padding correctly. It's certainly important to cover that as you mentioned. Accordingly, RFC 4648 specifies these test vectors:
BASE64("") = ""
BASE64("f") = "Zg=="
BASE64("fo") = "Zm8="
BASE64("foo") = "Zm9v"
BASE64("foob") = "Zm9vYg=="
BASE64("fooba") = "Zm9vYmE="
BASE64("foobar") = "Zm9vYmFy"
Some of your implementations may produce base64 output that differs only by whether they insert line breaks, and where implementations that break lines insert the break and the line termination used. You would have to do additional testing to determine whether you can safely replace an implementation that's using one style with a different one. In particular, a decoder might make assumptions about line length or termination.

Using regular expression for validating data is correct or not?

I have been finding some articles and post which suggest not to use the regular expression to validate user data. I am not sure of all the things but i usually find it in case of email address verification.
So i want to be clear whether using regular expression for validating user input is good or not? if it is good then what is bad with it for validating email address?
Edit:
So can we say that for basic primary validation of data types we can use regex and it is good and for full validation we need to combine it with another parser.
And for second part for email validation in general usage we can use it but as per standard it is not appropriate. Is it?
Now confusion in selecting correct one answer
It’s good because you can use regular expressions to express and test complex patterns in an easy way.
It’s bad because regular expressions can be complicated and there is much you can do wrong.
Edit    Well, ok. Here’s some real advice: First make sure that the expected valid values can be expressed using regular expression at all. That is when the language of valid values is a regular language. Otherwise you simply cannot use regular expressions (or at least not regular expressions only)!
Now that we know what can be validated using regular expressions, we should discuss what is viable to be validated using regular expressions. If we take an e-mail address as an example (like many others did), we should know what a valid e-mail address may look like (see RFC 5322):
addr-spec = local-part "#" domain
local-part = dot-atom / quoted-string / obs-local-part
domain = dot-atom / domain-literal / obs-domain
domain-literal = [CFWS] "[" *([FWS] dtext) [FWS] "]" [CFWS]
dtext = %d33-90 / ; Printable US-ASCII
%d94-126 / ; characters not including
obs-dtext ; "[", "]", or "\"
Here we see that the local-part may consists of a quoted-string that may contain any printable US-ASCII character (excluding \ and "", but including #). So it is not sufficient to test if the e-mail address contains just one # if we want to allow addresses according to RFC 5322.
On the other hand, if we want to allow any valid e-mail address according to RFC 5322, we would also allow addresses that do probably not exists or are just senseless in most cases (e.g. ""#localhost).
Your question seems to have two parts: (1) is using regular expressions for data validation bad, and (2) is using them for validating email addresses bad?
Re (1), this really depends upon the situation. In many situations a regular expression will be more than adequate to validate user input; for example, validating that a username has only alphanumeric characters. Where a set of regular expressions will probably be inadequate is when the input might be passed to something like a database query or an eval() statement. In these instances there may be language constructs like recursion that cannot be handled with regular expressions, and, more generally, you will want something that knows a lot about the target language to do the validation (and sanitization).
In most cases you'll want to escape the input so that it will will be an innocuous string in the target language.
If you are validating the correctness of code, you will want a full-blown parser for this. A parser may make use of regular expressions, but typically parsers use other things to do the heavy lifting.
Regular expressions can be bad for three reasons:
They can get really complicated, and eventually unmaintainable. It's very easy to make mistakes.
There are certain types of text that cannot be parsed with regular expressions at all (e.g. HTML). Basically, anything with nested patterns cannot be parsed with regular expressions. You wouldn't be able to parse a programming language with regex, for example.
Depending on what kind of text you are working with, it may be easier and clearer if you just write your own code to parse it.
But if neither of these is an issue for whatever you are working with, then there is nothing wrong with using regular expressions. I would say validating email addresses is a good use of regex.
Regular expressions are a tool like any other, albeit a very powerful one.
They are so powerful that people using them tend to suffer from the problem of everything looking like a nail (when you have a hammer). This leads to them being used in situations where another method would be more verbose but more efficient and more maintainable.
In the specific case of email addresses, the main problem here is that there are a very large number of regular expressions out there which claim to validate email address syntax, but are loaded up with problems that cause false negatives.
The main problems with them include:
Disallowing plus characters in the first half of the address (despite them being relatively common)
Limiting the TLD to three characters (this blocking out the .museum TLD)
Limiting the TLD to two character country code TLDs or a list of specific TLDs (thus forcing it to be updated whenever a new TLD comes into play — guess what never happens?)
Email addresses are so complex that a regular expression shouldn't really try to do anything more then:
Something that doesn't include an #
An #
Something that doesn't include an #
A .
Something that doesn't include an #
For e-mail addresses is good to use regular expressions. It will work in most of the cases.
In general: you should validate with regular expressions whatever can be expressed as a regular language
If the pattern of the data you are validating can be expressed completely and correctly using regular expressions, you can use them safely with no worries. However not all textual patterns can be expressed using regular expressions (e.g. context free grammars). In such cases you might need to write a parser or a custom method for validating the data.
The concerns are probably about the fact that often the regular expressions in use do not cover all the possible (valid) inputs and/or restrict the user to much in what he can input.
I see no other way to validate if some user input matches a certain schema (I mean, that is what regular expressions are for), so they are essential (imo) for user input validation. But you definitely have to put some time into designing an expression, to make sure it really works, also in extreme cases.
Take credit card numbers. You have to consider the ways a user might enter them:
1234-5678
// or
1234 5678
// or
1234 - 5678
And now you have two possibilities:
You restrict the input to the first case which will result in an easier expression but will restrict (and maybe annoy) the user the most.
You create an expression that accepts any of these possibilities, making the expression more complicated (hence harder to maintain) but is more use friendly.
It is a trade-off.
Regexes aren't bad for validating most data, if it's a Regular Language.
But, as has been noted, sometimes they can become difficult to maintain, and the programmers introduce errors.
The simplest way to mitigate the situation is with Tests/TDD. These tests should be calling a method that uses the regular expression to validate email addresses (I currently use this regex /^[A-Z0-9._%+-]+#(?:[A-Z0-9-]+\.)+[A-Z]{2,4}$/i which is working well enough. This way, when you get a false positive or false negative, you can add another test for that case, adjust your regular expression, and ensure you didn't break some other condition.
If TDD seems a bit much, a tool like Expresso lets you save regexes with test data, and that can aid in keeping track of values that should pass/fail and aid in creating and understanding your regex.
WARNING:
Take some care in constructing regular expressions. There is potential for introducing ReDos vulnerabilities
See: http://msdn.microsoft.com/en-us/magazine/ff646973.aspx
In short, a poorly constructed regex, given the right input can take hours to execute effectively killing your servers performance.

Regular expressions Lexical Analysis

Why repeated strings such as
[wcw|w is a string of a's and b's]
cannot be denoted by regular expressions?
pls. give me detailed answer as i m new to lexical analysis.
Thanks ...
Regular expressions in their original form describe regular languages/grammars. Those cannot contain nested structures as those languages can be described by a simple finite state machine. Simplified you can picture that as if each word of the language grows strictly from left to right (or right to left), where repeating structures have to be explicitly defined and are static.
What this means is, that no information whatsoever from previous states can be carried over to later states (a few characters further in the input). So if you have your symbol w you can't specify that the input must have exactly the same string w later in the sequence. Similarly you can't ensure that each opening paranthesis needs a closin paren as well (so regular expressions themselves are not even a regular language and thus cannot be described by regular expressions :-)).
In theoretical computer science we worked with a very restricted set of regex operators, basically only consisting of sequence, alternative (|) and repetition (*), everything else can be described with those operations.
However, usually regex engines allow grouping of certain sub-patterns into matches which can then be referenced or extracted later. Some engines even allow to use such a backreference in the search expression string itself, thereby allowing the expression to describe more than just a regular language. If I remember correctly such use of backreferences can even yield languages that are not context-free.
Additional pointers:
This StackOverflow question
Wikipedia
It can be, you just can't assure that it's the same string of "a"s and "b"s because there's no way to retain the information acquired in traversing the first half for use in traversing the second.