Escaping a single character using a character class - regex

In a regular expression, the normal way to use a special character (\^$.|?*+()[]{}) as a literal is, of course, to escape it with a backslash:
\+\.
But I have occasionally seen code that uses a character class to achieve the same thing:
[+][.]
Now obviously that isn't the primary purpose of a character class, which is normally used to match one of several characters. While the second example uses more keystrokes, you could argue that it's also more readable.
So is there any good reason not do this (performance or otherwise)? Or does it simply come down to personal stylistic preference?
I know this isn't an earth-shattering issue—it's just a little question that has been niggling away at the back of my mind for a while, and I've not been able to find any specific mention of it elsewhere.

I tend to view using a character class as a means of escaping a single character as a side-effect of character classes, which is not their primary purpose. The main reason for a character class is to represent a range of characters, not just a single character.
So, one possibly negative thing about the pattern [+][.] is that it might leave a future reader of your regex wondering if you did not intend to include more than one character in the character class. And perhaps, given certain conditions, that reader might even change the pattern to "fix" it, by adding characters to the class which he perceives as having been wrongfully omitted.
There might be slight performance advantage to using \+ over [+], in that the latter might require the regex engine to compile a formal list (with just one character in it). But, I would expect performance differences to be minimal.

Related

Encode/decode certain text sequences in Qt

I have a QTextEdit where the user can insert arbitrary text. In this text, there may be some special sequences of characters which I wish to translate automatically. And from the translated version, I wish I could go back to the sequences.
Take for instance this:
QMessageBox::information(0, "Foo", MAGIC_TRANSLATE(myTextEdit->text()));
If the user wrote, inside myTextEdit's text, the sequence \n, I would like that MAGIC_TRANSLATE converted the string \n to an actual new line character.
In the same way, if I give a text with a new line inside it, a MAGIC_UNTRANSLATE will convert the newline with a \n string.
Now, of course I can implement these two functions by myself, but what I am asking is if there is something already made, easy to use, in Qt, which allows me to specify a dictionary and it does the rest for me.
Note that sequences with common prefix can create some conflicts, for example converting:
\foo -> FOO
\foobar -> FOOBAR
can give rise to issues when translating the text asd \foobar lol, because if \foo is searched and replaced before \foobar, then the resulting text will be asd FOObar lol instead of the (more natural) asd FOOBAR lol.
I hope to have made clear my needs. I believe that this may be a common task, so I hope there is a Qt solution which takes into account this kind of issues when having conflicting prefixes.
I am sorry if this is a trivial topic (as I think it may be), but I am not familiar at all with encoding techniques and issues, and my knowledge of Qt encoding cover only very simple Unicode-related issues.
EDIT:
Btw, in my case a data-oriented approach, based on resources or external files or anything that does not requires a recompilation would be great.
It sounds like your question is, "I want to run a sequence of regular expression or simple string replacements to map between two encodings of some text".
First you need to work out your mapping, exactly. As you say, if your escape sequences like \foo and \foobar are fiddly, you might find that you don't have a bidirectional, lossless mapping. No library in the world can help you if your design or encoding is flawed.
When you end up with a precise design (which we can't help you on given the complete lack of information provided on the purpose of this function), you'll probably find that a sequence of string replacements is fine. If it really is more complicated, then some QRegExps should be enough.
It is always a bit ugly to self-answer questions, but... Maybe this solution is useful to someone.
As suggested by Nicholas in his answer, a good strategy is to use replacement. It is simple and effective in most cases, for example in the plain C/C++ escaping:
\n \r \t etc
This works because they are all different. It will always work with a replacement if the sequences are all different and, in particular, if no sequence is a prefix to another sequence.
For example, if your sequences are the one aboves plus some greek letters, you will not like the \nu sequence, which should be translated to ν.
Instead, if the replacing function tests for \n before \nu, the result is wrong.
Assuming that both sequences will be translated in two completely different entities, there are two solutions: place a close-sequence character, for example \nu;, or just replace by longest to shorter strings. This ensure that any sequence which is prefix of another one is not replaced before it.
For various reasons, I tried another way: using a trie, which is a tree of all the prefixes of a dictionary of words. Long story short: it works fairly well and probably works faster than (most) regexes and replacements.
Regex are state machines and it is not rare to re-process the input, with a trie, you avoid to re-match characters twice, so you go pretty fast.
Code for tries is pretty easy to find on the internet, and the modifications to do efficient matching are trivial, so I will not write the code here.

removing redundancy from regex with multiple possible delimiters

I have a regex in which the same match criteria can apply to multiple delimiters. [], (), and <> are all valid. For example purposes it looks like this:
\[.\]|\(.\)|<.>
Is there some way to remove the redundancy from the above regex? The match criteria inside the delimiters is always the same, but the delimiters themselves may be different.
I'm guessing you're asking because
[[(<].[])>]
isn't exact enough, for obvious reasons.
It's always dangerous to answer, "No, there is no way," because it's hard to be sure one has checked every possible way. One must often come up with a solid proof to answer in such cases.
I'm not sure this is a strong-enough proof, or even a "proof" at all, but consider this (pseudo-)information-theory perspective:
The PCRE engine itself has no knowledge of any relation between the pairs of characters, [], (), and <>. Thus, the expression itself must contain that information, i.e. require at least the six characters []()<> to be present.
Not only that, but for the same reason, the expression itself must define at least two pairings (leaving the third to be implied). I'm not sure how to prove that two alternation operators (|) is the best you can do, but I mean, even if there were a more compact way, you're going to save one character at most, since at least one bit is required to say, "Pairings exist!"
The escaping of meta-characters can only be compacted by the fact that []() can appear within character classes without being escaped, but firstly, that isn't really a "removal of redundancy" as much as it is "a lucky circumstance in syntax", and secondly, you still have to add two characters for the definition of said character class: [].
Therefore, it is my belief that even from a theoretical perspective, if my presumptions about what a regex engine cannot know are true, then one can save at most three characters from the regex you've already provided: \[.\]|\(.\)|<.>.
I eagerly look forward to being corrected by the regex gurus!
If you really are using the PCRE library (via PHP, for example) you can use a DEFINE group to create a subroutine, like so:
'~(?(DEFINE)(?<content>\w+))(?:<(?&content)>|\[(?&content)\]|\((?&content)\))~'
...or more readably:
(?(DEFINE)(?<content>\w+))
(?:
<(?&content)>
|
\[(?&content)\]
|
\((?&content)\)
)
Here's a demo in PHP. It should work in Perl, too.

differentiating and testing regex variants

Several implementations of regular expressions differ from each other in subtle ways which is the source of much confusion when I try to use them.
Most of these differences include the semantics related to whether a character is escaped or not. This is most often an issue with parentheses, but can apply to curly brackets and others. This is probably a consequence of the syntax of the language or environment in which the implementation is found. For instance, if the $ symbol indicates a variable name in some language, one can expect regular expressions represented in that language would require escaping the "end of line" anchor to \$ or some such. But what gets confusing at this point is how you would represent an actual dollar sign. I believe Perl gets around this by wrapping a regex inside forward slashes /.
Similarly there are escapes for specific characters themselves, for instance non printing characters such as \n and \t. Then there are the similar looking generic character groups such as \d for digits, \s for whitespace, and \w which I just learned covers underscores as well as digits. I found myself on several occasions trying to use \a for a "alphabetical" group but this only ended up matching the bell character 0x07.
It's pretty clear that there is no simple and one-shot solution to knowing all of the differences in features and syntax offered by the myriad of implementations of regular expressions out there, short of somebody doing all the hard work and putting results in a well organized table. Here is one example of exactly this, but of course it doesn't cover several of the programs that I use extensively myself, which include vim, sed, Notepad++, Eclipse, and believe it or not MS Word (at least version 2010, I suspect 2007 also has this, they call it "wildcards") has a simple regex implementation too.
I guess what I want is to be as lazy as possible (in a certain sense) by trying to come up with a way to determine for any given regex implementation what its "escape settings" are beyond any doubt by applying one (or a few) queries.
I'm thinking I can make a file which contains test cases, along with a huge regex query, and somehow engineer it so that running it once will show me exactly what syntax I need to use subsequently without doubting myself any further. (as opposed to having to edit files and use multiple queries to figure out the same thing which gets terribly old after a while).
If nobody else has attempted to construct such a monstrosity, I may undertake this task myself. If it's even possible. Is this possible?
I tried to come up with an example (it was just to figure out if EOL anchor is $ or \$) but in every case I had to use a multitude of different search/replace queries in order to determine how the program will respond to the input.
Edit: I came up with something using capturing and backtracking. I gotta work on it a little more.
Update: Well, Notepad++ does not implement the OR operator commonly denoted by the pipe |. Word's "wildcards" is a poor substitute also, it doesn't have | or *. I'm fairly certain that missing any of the regular expression operators (union, concat, star) means it cannot generate a regular grammar, so those two are ruled out.
I can create an input file like this:
$
*
]
EOL
and query
(\$)|(\*)|(\[)|($)
replacing with
escDollar:\1:escStar:\2:escSQBrL:\3:Dollar:\4:
yields a result of (assuming unescaped parens is group and unescaped pipe is or)
escDollar:$:escStar::escSQBrL::Dollar::
escDollar::escStar:*:escSQBrL::Dollar::
]escDollar::escStar::escSQBrL::Dollar::
EOLescDollar::escStar::escSQBrL::Dollar::
I ran this in vim. This output would demonstrate the single characters that are matched by each item specified next to it, i.e. the escaped dollar sign item is seen to match the actual dollar sign character rather than the non escaped dollar sign item at the end.
It's difficult to see what's going on with the $ anchor since it matches zero characters, but it shouldn't be hard to find a solution for it. Besides it's not a commonly mistaken one. The ones I'm particularly worried about are pipe and parens and the different brackets. When you've got 4 different types in there there are 2^4 combinations of escaped and non-escaped versions of them you can use. Trial-and-error with that is horrific.
This output isn't too hard to parse at a glance, and is also seriously easy to process as part of a script. The one glaring problem that remains is figuring out whether parens and pipe need to be escaped. Because the functionality of the whole thing depends on them.
It would seem like that will require multiple queries. It may be possible with a cleverly engineered jumble of backslashes, parens, and pipes to figure out the combination (only 4 possibilities after all) with an initial query, then choose the subsequent matrix generator query based on it.
Something like this shows it can work:
(e)
(f)
querying
\((f\))|\|\((e\))
replace with
\1:\2
would produce:
:(e if escaped parens is group and escaped pipe is or
:e) if parens is group and escaped pipe is or
(f: if escaped parens is group and pipe is or
f): if parens is group and pipe is or
I still don't really like this though because it requires a second query on a second set of input. Too much setting up. I may just make 4 copies of the "matrix" thing.
The table on this page summarizes quite nicely which features are available in which regex implementations:
http://www.regular-expressions.info/refflavors.html

regex match upto some character

Conditions updated
There is often a situation where you want to extract a substring upto (immediately before) certain characters. For example, suppose you have a text that:
Does not start with a semicolon or a period,
Contains several sentences,
Does not contain any "\n", and
Ends with a period,
and you want to extract the sequence from the start upto the closest semicolon or period. Two strategies come to mind:
/[^;.]*/
/.*?[;.]/
I do either of these quite randomly, with slight preference to the second strategy, and also see both ways in other people's code. Which is the better way? Is there a clear reason to prefer one over the other, or are there better ways? I personally feel, efficiency aside, that negating something (as with [^]) is conceptually more complex than not doing it. But efficiency may also be a good reason to chose one over the other.
I came up with my answer. The two regexes in my question were actually not expressing the same thing. And the better approach depends on what you want.
If you want a match up to and including a certain character, then using
/.*?[;.]/
is simpler.
If you want a match up to right before (excluding) a certain character, then you should use:
/[^;.]*/
Well, the first way is probably more efficient, not that it's likely to matter. By the way, \z in a character class does not mean "end of input"--in fact, it's a syntax error in every flavor I know of. /[^;.]*/ is all you need anyway.
I personally prefer the first one because it does exactly as you would expect. Get all characters except ...
But it's mostly a matter of preference. There are nearly always multiple ways to write a regular expression and it's mostly style that matters.
For example... do you prefer [0-9], [:digit:] or \d? They all do exactly* the same.
* In case of unicode the [:digit:] and \d classes match some other characters too.
you left out one other strategy. string split?
"my sentence; blahblah".split(/[;.]/,2)[0]
I think that it is mostly a matter of opinion as to which regular expression you use. On the note of efficiency, though, I think that adding \A to the beginning of a regular expression in this case would make the process faster because well designed regular expression engines should only try to match once in that case. For example:
/\A[^.;]/m
Note the m option; it indicates that newline characters can also be matched. This is just a technicality I would add for generic examples, but may not apply to you.
Although adding more to the solution might be viewed as increasing complexity, it can also serve to clarify meaning.

boost::regex - \bb?

I have some badly commented legacy code here that makes use of boost::regex::perl. I was wondering about one particular construct before, but since the code worked (more or less), I was loath to touch it.
Now I have to touch it, for technical reasons (more precisely, current versions of Boost no longer accepting the construct), so I have to figure out what it does - or rather, was intended to do.
The relevant part of the regex:
(?<!(\bb\s|\bb|^[a-z]\s|^[a-z]))
The piece that gives me headaches is \bb. I know of \b, but I could not find mention of \bb, and looking for a literal 'b' would not make sense here. Is \bb some special underdocumented feature, or do I have to consider this a typo?
As Boost seems to be a regex engine for C++, and one of the compatibility modes is perl compatibility--if that is a "perl-compatible" expression, than the second 'b' can only be a literal.
It's a valid expression, pretty much a special case for words beginning with 'b'.
It seems to be the deciding factor that this is a c++ library, and that it's to give environments that aren't perl, perl-compatible regexes. Thus my original thought that perl might interpret the expression (say with overload::constant) is invalid. Yet it is still worth mentioning just for clarification purposes, regardless of how inadvisable it would be tweak an expression meaning "word beginning with 'b'".
The only caveat to that idea is that perhaps Boost out-performs Perl at it's own expressions and somebody would be using the Boost engine in a Perl environment, then all bets are off as to whether that could have been meant as a special expression. This is just one stab, given a grammar where '!!!' meant something special at the beginning of words, you could piggyback on the established meaning like this (NOT RECOMMENDED!)
s/\\bb\b/(?:!!!(\\p{Alpha})|\\bb)/
This would be something dumb to do, but as we are dealing with code that seems unfit for its task, there are thousands of ways to fail at a task.
(\bb\s|\bb|^[a-z]\s|^[a-z]) matches a b if it's not preceded by another word character, or any lowercase letter if it's at the beginning of the string. In either case, the letter may be followed by a whitespace character. (It could match uppercase letters too if case-insensitive mode is set, and the ^ could also match the beginning of a line if multiline mode is set.)
But inside a lookbehind, that shouldn't even have compiled. In some flavors, a lookbehind can contain multiple alternatives with different, fixed lengths, but the alternation has to be at the top level in the lookbehind. That is, (?<=abc|xy|12345) will work, but (?<=(abc|xy|12345)) won't. So your regex wouldn't work even in those flavors, but Boost's docs just say the lookbehind expression has to be fixed-length.
If you really need to account for all four of the possibilities matched by that regex, I suggest you split the lookbehind into two:
(?<!\bb|^[a-z])(?<!(?:\bb|^[a-z])\s)