POSIX Character Class with Negation - regex

I'm curious as to whether there is anyway to have a regex expression which looks for input which is printable (defined by the POSIX character class [:print:], but also does not contain a specific letter, such as the letter a.
Such an expression would enable me to look for all characters which are printable, and then perform additional exclusions. My initial thought was to use nested character classes to achieve this, but I do not believe that will work.
This is for a small parser which I am working on in lex -- thanks for any feedback.

flex (if you can use that) offers the {-} operator which provides exactly what you're looking for:
[[:print:]]{-}[a]
It also has an {+} operator.. They only work with character classes, though.

In PCRE and other engines with lookaround, you could use that (e.g. [[:print:]](?<!a)), but unless it has changed recently, lex doesn't support lookaround.
While there are probably ways to make this distinction in the lexical analyzer, it may be cleaner to do it in the parsing logic instead.

Related

Regex character interval with exception

Say I have an interval with characters ['A'-'Z'], I want to match every of these characters except the letter 'F' and I need to do it through the ^ operator. Thus, I don't want to split it into two different intervals.
How can I do it the best way? I want to write something like ['A'-'Z']^'F' (All characters between A-Z except the letter F). This site can be used as reference: http://regexr.com/
EDIT: The relation to ocaml is that I want to define a regular expression of a string literal in ocamllex that starts/ends with a doublequote ( " ) and takes allowed characters in a certain range. Therefore I want to exclude the doublequotes because it obviously ends the string. (I am not considering escaped characters for the moment)
Since it is very rare to find two regular expressions libraries / processors with exactly the same regular expression syntax, it is important to always specify precisely which system you are using.
The tags in the question lead me to believe that you might be using ocamllex to build a scanner. In that case, according to the documentation for its regular expression syntax, you could use
['A'-'Z'] # 'F'
That's loosely based on the syntax used in flex:
[A-Z]{-}[F]
Java and Ruby regular expressions include a similar operator with very different syntax:
[A-Z&&[^F]]
If you are using a regular expression library which includes negative lookahead assertions (Perl, Python, Ecmascript/C++, and others), you could use one of those:
(?!F)[A-Z]
Or you could use a positive lookahead assertion combined with a negated character class:
(?=[A-Z])[^F]
In this simple case, both of those constructions effectively do a conjunction, but lookaround assertions are not really conjunctions. For a regular expression system which does implement a conjunction operator, see, for example, Ragel.
The ocamllex syntax for character set difference is:
['A'-'Z'] # 'F'
which is equivalent to
['A'-'E' 'G'-'Z']
(?!F)[A-Z] or ((?!F)[A-Z])*
This will match every uppercase character excluding 'F'
Use character class subtraction:
[A-Z&&[^F]]
The alternative of [A-EG-Z] is "OK" for a single exception, but breaks down quickly when there are many exceptions. Consider this succinct expression for consonants (non-vowels):
[B-Z&&[^EIOU]]
vs this train wreck
[B-DF-HJ-NP-TV-Z]
The regex below accomplishes what you want using ^ and without splitting into different intervals. It also resambles your original thought (['A'-'Z']^'F').
/(?=[A-Z])[^F]/ig
If only uppercase letters are allowed simple remove the i flag.
Demo

Transform regex with character classes and repetitions to its most basic ASCII form

Is there a way, a regular expression maybe or even a library, which can transform a regular expression with character classes and repetition to its most basic ASCII form.
For example I'd like to have the following conversions:
\d -> [0-9]
\w -> [A-Za-z0-9_]
\s -> [ \t\r\n\v\f]
\d{2} -> [0-9][0-9]
\d{3,} -> [0-9][0-9][0-9]+
\d{,3} -> I dont even know how to show this...
There is a commercial product called RegexBuddy that lets you enter a regex in their syntax and then generate the version for any of a number of popular systems. There may be something similar out there for free, or you could write your own.
At its most basic, a regular expression syntax only needs two things: alternation (OR) and closure (STAR). Well, and grouping. OK, three things. Other common operators are just shortcuts, really:
x+ = xx*
x? = (|x)
[xyz] = (x|y|z)
etc.
Things like \d just map to character classes and then to alternations. Negated character classes and . map to very big alternations. :)
There are some features that don't translate, however, such as lookaround. Mapping those to something that works without the feature is not readily automatable; it will depend upon the particular circumstances motivating their use.
First, you'd have to define which transformations you want to do. As written in the comments, not all advanced features can be written in terms of simpler operators. For example, the lookaround operators have no substitute. So you're limited by the target regexp parser anyway.
Then, with this list of transformations, you should simply apply them. They can probably be written as regexps themselves, but it might be easier to write a script in Python or so to actually parse (but not evaluate) the regexp. Then it can write it back with the requested transformations applied. And bark at you if you've used too complex features.
This wouldn't be too hard, but I'm not so sure if it would be very useful either. If you need powerful regexps, use a better regexp engine. It should be easy to write a simple Python or Perl script instead of a simple Awk script, for example.

boost::regex - \bb?

I have some badly commented legacy code here that makes use of boost::regex::perl. I was wondering about one particular construct before, but since the code worked (more or less), I was loath to touch it.
Now I have to touch it, for technical reasons (more precisely, current versions of Boost no longer accepting the construct), so I have to figure out what it does - or rather, was intended to do.
The relevant part of the regex:
(?<!(\bb\s|\bb|^[a-z]\s|^[a-z]))
The piece that gives me headaches is \bb. I know of \b, but I could not find mention of \bb, and looking for a literal 'b' would not make sense here. Is \bb some special underdocumented feature, or do I have to consider this a typo?
As Boost seems to be a regex engine for C++, and one of the compatibility modes is perl compatibility--if that is a "perl-compatible" expression, than the second 'b' can only be a literal.
It's a valid expression, pretty much a special case for words beginning with 'b'.
It seems to be the deciding factor that this is a c++ library, and that it's to give environments that aren't perl, perl-compatible regexes. Thus my original thought that perl might interpret the expression (say with overload::constant) is invalid. Yet it is still worth mentioning just for clarification purposes, regardless of how inadvisable it would be tweak an expression meaning "word beginning with 'b'".
The only caveat to that idea is that perhaps Boost out-performs Perl at it's own expressions and somebody would be using the Boost engine in a Perl environment, then all bets are off as to whether that could have been meant as a special expression. This is just one stab, given a grammar where '!!!' meant something special at the beginning of words, you could piggyback on the established meaning like this (NOT RECOMMENDED!)
s/\\bb\b/(?:!!!(\\p{Alpha})|\\bb)/
This would be something dumb to do, but as we are dealing with code that seems unfit for its task, there are thousands of ways to fail at a task.
(\bb\s|\bb|^[a-z]\s|^[a-z]) matches a b if it's not preceded by another word character, or any lowercase letter if it's at the beginning of the string. In either case, the letter may be followed by a whitespace character. (It could match uppercase letters too if case-insensitive mode is set, and the ^ could also match the beginning of a line if multiline mode is set.)
But inside a lookbehind, that shouldn't even have compiled. In some flavors, a lookbehind can contain multiple alternatives with different, fixed lengths, but the alternation has to be at the top level in the lookbehind. That is, (?<=abc|xy|12345) will work, but (?<=(abc|xy|12345)) won't. So your regex wouldn't work even in those flavors, but Boost's docs just say the lookbehind expression has to be fixed-length.
If you really need to account for all four of the possibilities matched by that regex, I suggest you split the lookbehind into two:
(?<!\bb|^[a-z])(?<!(?:\bb|^[a-z])\s)

Is there any way to have dot (.) match newline in C++ TR1 Regular Expressions?

I couldn't find anything regarding this on http://msdn.microsoft.com/en-us/library/bb982727.aspx.
Maybe I could use '[^]+' to match everything but that seems like a hack?
Boost.Regex has a mod_s flag to make the dot match newlines, but it's not part of the TR1 regex standard. (and not available as a Microsoft extension either, as far as I can see)
As a workaround, you could use [\s\S] (which means match any whitespace or any non-whitespace).
As C++ regular expressions appear to be based on ECMAScript regular expressions, the answer to the recent question about the same thing in JavaScript may help you.
[^] should work, but if you want something a little more clear and less hackish, you could try (.|\n).
One trick people use is a character class containing anything that is not the null character. The null character is expressed in hex. It looks something like this:
[^\x00]+
You can switch to a non-ECMA flavor of regular expression (there are a number of flags to control regext flavor). Any POSIX regex should, if I recall correctly, match a newline to ..

Can regular expressions work with different languages?

English, of course, is a no-brainer for regex because that's what it was originally developed in/for:
Can regular expressions understand this character set?
French gets into some accented characters which I'm unsure how to match against - i.e. are è and e both considered word characters by regex?
Les expressions régulières peuvent comprendre ce jeu de caractères?
Japanese doesn't contain what I know as regex word characters to match against.
正規表現は、この文字を理解でき、設定?
Short answer: yes.
More specifically it depends on your regex engine supporting unicode matches (as described here).
Such matches can complicate your regular expressions enormously, so I can recommend reading this unicode regex tutorial (also note that unicode implementations themselves can be quite a mess so you might also benefit from reading Joel Spolsky's article about the inner workings of character sets).
"[\p{L}]"
This regular expression contains all characters that are letters, from all languages, upper and lower case.
so letters like (a-z A-Z ä ß è 正 の文字を理解) are accepted but signs like (, . ? > :) or other similar ones are not.
the brackets [] mean that this expression is a set.
If you want unlimited number of letters from this set to be accepted, use an astrix * after the brackets, like this: "[\p{L}]*"
it is always important to make sure you take care of white space in your regex. since your evaluation might fail because of white space. To solve this you can use: "[\p{L} ]*" (notice the white space inside brackets)
If you want to include the numbers as well, "[\p{L|N} ]*" can help. p{N} matches any kind of numeric character in any script.
As far as I know, there isn't any specific pattern you can use i.e. [a-zA-Z] to match "è", but you can always match them in separately, i.e. [a-zA-Zè正]
Obviously that can make your regexp immense, but you can always control this by adding your strings into variables, and only passing the variables into the expressions.
Generally speaking, regex is more for grokking machine-readable text than for human-readable text. It is in many ways a more general answer to the whole XML with regex thing; regex is by its very nature incapable of properly parsing human language, because the language is more complex than what you are using to parse it.
If you want to break down human language (English included), you would want to use a language analysis tool or even an AI, not mere regular expressions.
/[\p{Latin}]/ should for example, include Latin alphabet. You can get the full explanation and reference here.
it is not about the regular expression but about framework that executes it. java and .net i think are very good in handling unicode. so "è and e both considered word characters by regex" is true.
It depends on the implementation and the character set. In general the answer is "Yes," but it may require additional setup on your part.
In Perl, for example, the meaning of things like \w is altered by the chosen locale (use locale).
This SO thread might help. It includes the Unicode character classes you can use in a regex (e.g., [Ll] is all lowercase letters, regardless of language).