When and why did the output of qr() change? - regex

The output of perl's qr has changed, apparently sometime between versions 5.10.1 and 5.14.2, and the change is not documented--at least not fully.
To demonstrate the change, execute the following one-liner on each version:
perl -e 'print qr(foo)is."\n"'
Output from perl 5.10.1-17squeeze6 (Debian squeeze):
(?-xism:foo)
Output from perl 5.14.2-21+deb7u1 (Debian wheezy):
(?^:foo)
The perl documentation (perldoc perlop) says:
$rex = qr/my.STRING/is;
print $rex; # prints (?si-xm:my.STRING)
s/$rex/foo/;
which appears to no longer be true:
$ perl -e 'print qr/my.STRING/is."\n"'
(?^si:my.STRING)
I would like to know when this change occurred (which version of Perl, or supporting library or whatever).
Some background, in case it's relevant:
This change has caused a bunch of unit tests to fail. I need to decide if I should simply update the unit tests to reflect the new format, or make the tests dynamic enough to support both formats, etc. To make an informed decision, I would like to understand why the change took place. Knowing when and where it took place seems like the best place to start in that investigation.

It's documented in perl5140delta:
Regular Expressions
(?^...) construct signifies default modifiers
[...] Stringification of regular expressions now uses this notation. [...]
This change is likely to break code that compares stringified regular expressions with fixed strings containing ?-xism.
The function regexp_pattern can be used to parse the modifiers for normalisation purposes.

Part of the reason this was added, was that regular expressions were getting quite a few new modifiers.
Your example would actually produce something like this if that change didn't happen:
(?d-xismpaul:foo)
That also doesn't really express the modifiers in place.
d/u/l can only be added to a regex, not subtracted like i.
They are also mutually exclusive.
a/aa There are actually two levels for this modifier.
While work went underway adding these modifiers it was determined that this will break quite a few tests on CPAN modules.
Seeing as the tests were going to break anyway, it was agreed upon that there should be a way of specifying just use the defaults ((?^:…)).
That way, the tests wouldn't have to updated every time a new modifier was added.

To receive the stringified form of a regexp you can use Regexp::Parser and its qr method. Using this module you can not only test the representation of a regexp, but also walk a tree.

Related

Finding and modifying function definitions (C++) via bash-script

Currently I am working on a fairly large project. In order to increase the quality of our code, we decided to enforce the treatement of return values (Error Codes) for every function. GCC supports a warning concerning the return value of a function, however the function definition has to be preceeded by the following flag.
static __attribute__((warn_unused_result)) ErrorCode test() { /* code goes here */ }
I want to implement a bashscript that parses the entire source code and issues a warning in case the
__attribute__((warn_unused_result))
is missing.
Note that all functions that require this kind of modification return a type called ErrorCode.
Do you think this is possible via a bash script ?
Maybe you can use sed with regular expressions. The following worked for me on a couple of test files I tried:
sed -r "s/ErrorCode\s+\w+\s*(.*)\s*\{/__attribute__((warn_unused_result)) \0/g" test.cpp
If you're not familiar with regex, the pattern basically translates into:
ErrorCode, some whitespace, some alphanumerics (function name), maybe some whitespace, open parenthesis, anything (arguments), close parenthesis, maybe some whitespace, open curly brace.
If this pattern is found, it is prefixed by __attribute__((warn_unused_result)). Note that this only works if you are putting the open curly brace always in the same line as the arguments and you don't have line breaks in your function declarations.
An easy way I could imagine is via ctags. You create a tag file over all your source code, and then parse the tags file. However, I'm not quite sure about the format of the tags file. The variant I'm using here (Exuberant Ctags 5.8) seems to put an "f" in the fourth column, if the tag represents a function. So in this case I would use awk to filter all tags that represent functions, and then grep to throw away all lines without __attribute__((warn_unused_result)).
So, in a nutshell, first you do
$ ctags **/*.c
This creates a file called "tags" in the current directory. The command might also be ctags-exuberant, depending on your variant. The **/*.c is a glob pattern that might work in your shell - if it doesn't, you have to supply your source files in another way (look at the ctagsoptions).
Then you filter the funktions:
$ cat tags | awk -F '\t' '$4 == "f" {print $0}' | grep -v "__attribute__((warn_unused_result))"
No, it is not possible in the general case. The C++ grammar is the most complex of all the languages I know of, and C++ is not parsable via regular expressions in the general case. You might succeed if you limit yourself to a very narrow set of uses, but I am not sure how feasible it is in your case.
I also do not think the excersise is worth the effort, since sometimes ignoring the result of the function is an OK thing.

Ignore multiline comments git diff

I'm trying to find the significant differences in C/C++ source code in which only source code changes. I know you can use the git diff -G<regex> but it seems very limiting in the kind of regexes that can be run. For example, it doesn't seem to offer a way to ignore multiline comments in C/C++.
Is there any way in git or preferably libgit2 to ignore comments (including multiline), whitespaces, etc. before a diff is run? Or a way of determining if a line from the diff output is a comment or not?
git diff -w to ignore whitespace differences.
You cannot ignore multiline comments because git is a versioning tool, not a language dependent interpreter. It doesn't know your code is C++. It does not parse files for semantics, so it cannot interpret what is comment and what isn't. In particular, it relies on diff (or a configured difftool) to compare text files and it expects a line-by-line comparison.
I agree with #andrew-c that what you are really asking is to compare the two pieces of code without comments. More specifically helpful, you are asking to compare the lines of code where all multiline comments have been turned into empty lines. You keep the blank lines there so you have the correct line numbers to reference on a normal copy.
So you could manually convert the two code states to blank out multiline comments... or you might look at building your own diff wrapper that did the stripping for you. But the latter is not likely to be worth the effort.
You can achieve this using git attributes and diff filters as described in Viewing git filters output when using meld as a diff tool to call a sed script, which however is pretty complex on its own if you want it to handle all cases like comment delimiters inside string literals etc.

Regular expression incremental parsing

Are there languages or tools that support the parsing of regexes on a character-by-character basis?
I think this may be equivalent to "regexes on streams" which is something that seems to be one of the features of the upcoming Perl version 6.
Basically I want to do this because I'm building a tool that does translation of a terminal stream over a pseudo-terminal, and it occurred to me that the ultimate sort of flexibility that should be attainable is by allowing the specification of regex-replace expressions.
The use case is that I want to allow my mouse scroll events to be passed to a naive program such as the less pager, which means my tool (which spawns less over a PTY) will be doing something like issuing the code \x1b[?1000h which switches on mouse reporting, and then subsequently translating every mouse wheel escape code received thereafter such as \x1b[M!! (the last several chars encode the mouse position within the terminal and should be ignored but also stripped) into the \x1b[A Up-arrow code.
As you can see being able to specify a regex that works on the stdin terminal-reading stream to generate the translated stream to send to the slave pty would be ideal.
Do I need to wait for Perl 6 to be able to achieve this? There must be particular reasons for why regex engines generally require having the whole string available?
It's pretty obvious I don't need the full blown power of regex here. I can speculate for instance that it might be the case that supporting backtracking makes stream-parsing regex impossible.
So since I don't need backtracking maybe there is some sort of light-weight regex engine out there that provides a stream API. It just seems like taking advantage of some form of parsing system (if one exists that is suitable) would be smarter than building something arbitrary.
Looks like s2p is an example of something that I can use.
In particular, the potential of being able to set $| to not do line-buffering.
Actually I don't think this will work. It seems to be built around lines and uses the s operator to run regex.

Global substitution for latex commands in vim

I am writing a long document and I am frequently formatting some terms to italics. After some time I realized that maybe that is now what I want so I would like to remove all the latex commands that format text to italics.
Example:
\textit{Vim} is undoubtedly one of the best editors ever made. \textit{LaTeX} is an extremely powerful, intelligent typesetter. \textbd{Vim-LaTeX} aims at bringing together the best of both these worlds
How can I run a substitution command that recognizes all the instances of \textit{whatever} and changes them to just whatever without affecting different commands such as \textbd{Vim-LaTeX} in this example?
EDIT: As technically the answer that helps is the one from Igor I will mark that one as the correct one. Nevertheless, Konrad's answer should be taken into account as it shows the proper Latex strategy to follow.
You shouldn’t use formatting commands at all in your text.
LaTeX is built around the idea of semantic markup. So instead of saying “this text should be italic” you should mark up the text using its function. For instance:
\product{Vim} is undoubtedly one of the best editors ever made. \product{LaTeX}
is an extremely powerful, intelligent typesetter. \product{Vim-LaTeX} aims at
bringing together the best of both these worlds
… and then, in your preamble, a package, or a document class, you (re-)define a macro \product to set the formatting you want. That way, you can adapt the macro whenever you deem necessary without having to change the code.
Or, if you want to remove the formatting completely, just make the macro display its bare argument:
\newcommand*\product[1]{#1}
Use this substitution command:
% s/\\textit{\([^}]*\)}/\1/
If textit can span muptiple lines:
%! perl -e 'local $/; $_=<>; s/\\textit{([^}]*)}/$1/g; print;'
And you can do this without perl also:
%s/\\textit{\(\_.\{-}\)}/\1/g
Here:
\_. -- any symbol including a newline character
\{-} -- make * non-greedy.

How can I replace all the HTML-encoded accents in Perl?

I have the following situation:
There is a tool that gets an XSLT from a web interface and embeds the XSLT in an XML file (Someone should have been fired). "Unfortunately" I work in a French speaking country and therefore the XSLT has a number of words with accents. When the XSLT is embedded in the XML, the tool converts all the accents to their HTML codes (Iacute, igrave, etc...) .
My Perl code is retrieving the XSLT from the XML and is executing it against an other XML using Xalan command line tool. Every time there is some accent in the XSLT the Xalan tool throws an exception.
I initially though to do a regexp to change all the accents in the XSLT usch as:
# the & is omitted in the codes becuase it will be rendered in the page
$xslt =~s/Aacute;/Á/gso;
$xslt =~s/aacute;/á/gso;
$xslt =~s/Agrave;/À/gso;
$xslt =~s/Acirc;/Â/gso;
$xslt =~s/agrave;/à/gso;
but doing so means that I have to write a regexp for each of the accent codes....
My question is, is there anyway to do this without writing a regexp per code? (thinking that is the only solution makes be want to vomit.)
By the way the tool is TeamSite, and it sucks.....
Edited: I forgot to mention that I need to have a Perl only solution, security does not let me install any type of libs they have not checked for a week or so :(
You can try something like HTML::Entities. From the POD:
use HTML::Entities;
$a = "Våre norske tegn bør &#230res";
decode_entities($a);
#encode_entities($a, "\200-\377"); ## not needed for what you are doing
In response to your edit, HTML::Entities is not in the perl core. It might still be installed on your system because it is used by a lot of other libraries. You can check by running this command:
perl -MHTML::Entities -le 'print "If this prints, the it is installed"'
For your purpose is HTML::Entities far best solution but if you will not found some existing package fits your needs following approach is more effective than multiple s/// statements
# this part do in inter function module code which is executed in compile time
# or place in BEGIN or do once before first s/// statement using it
my %trans = (
'Aacute;' => 'Á',
'aacute;' => 'á',
'Agrave;' => 'À',
'Acirc;' => 'Â',
'agrave;' => 'à',
); # remember you can generate parts of this hash for example by map
my $re = qr/${ \(join'|', map quotemeta, keys %trans)}/;
# this code place in your functions or methods
s/($re)/$trans{$1}/g; # 'o' is almost useless here because $re has been compiled yet
Edit: There is no need of e regexp modifier as mentioned by Chas. Owens.
I don't suppose it's possible to make TeamSite leave it as utf-8/convert it to utf-8?
CGI.pm has an (undocumented) unescapeHTML function. However, since it IS undocumented (and I haven't looked through the source), I don't know if it just handles basic HTML entities (<, >, &) or more. However, I'd GUESS that it only does the basic entities.
Why should someone be fired for putting XSL, which is XML, into an XML file?