How can I replace all the HTML-encoded accents in Perl? - regex

I have the following situation:
There is a tool that gets an XSLT from a web interface and embeds the XSLT in an XML file (Someone should have been fired). "Unfortunately" I work in a French speaking country and therefore the XSLT has a number of words with accents. When the XSLT is embedded in the XML, the tool converts all the accents to their HTML codes (Iacute, igrave, etc...) .
My Perl code is retrieving the XSLT from the XML and is executing it against an other XML using Xalan command line tool. Every time there is some accent in the XSLT the Xalan tool throws an exception.
I initially though to do a regexp to change all the accents in the XSLT usch as:
# the & is omitted in the codes becuase it will be rendered in the page
$xslt =~s/Aacute;/Á/gso;
$xslt =~s/aacute;/á/gso;
$xslt =~s/Agrave;/À/gso;
$xslt =~s/Acirc;/Â/gso;
$xslt =~s/agrave;/à/gso;
but doing so means that I have to write a regexp for each of the accent codes....
My question is, is there anyway to do this without writing a regexp per code? (thinking that is the only solution makes be want to vomit.)
By the way the tool is TeamSite, and it sucks.....
Edited: I forgot to mention that I need to have a Perl only solution, security does not let me install any type of libs they have not checked for a week or so :(

You can try something like HTML::Entities. From the POD:
use HTML::Entities;
$a = "Våre norske tegn bør &#230res";
decode_entities($a);
#encode_entities($a, "\200-\377"); ## not needed for what you are doing
In response to your edit, HTML::Entities is not in the perl core. It might still be installed on your system because it is used by a lot of other libraries. You can check by running this command:
perl -MHTML::Entities -le 'print "If this prints, the it is installed"'

For your purpose is HTML::Entities far best solution but if you will not found some existing package fits your needs following approach is more effective than multiple s/// statements
# this part do in inter function module code which is executed in compile time
# or place in BEGIN or do once before first s/// statement using it
my %trans = (
'Aacute;' => 'Á',
'aacute;' => 'á',
'Agrave;' => 'À',
'Acirc;' => 'Â',
'agrave;' => 'à',
); # remember you can generate parts of this hash for example by map
my $re = qr/${ \(join'|', map quotemeta, keys %trans)}/;
# this code place in your functions or methods
s/($re)/$trans{$1}/g; # 'o' is almost useless here because $re has been compiled yet
Edit: There is no need of e regexp modifier as mentioned by Chas. Owens.

I don't suppose it's possible to make TeamSite leave it as utf-8/convert it to utf-8?
CGI.pm has an (undocumented) unescapeHTML function. However, since it IS undocumented (and I haven't looked through the source), I don't know if it just handles basic HTML entities (<, >, &) or more. However, I'd GUESS that it only does the basic entities.

Why should someone be fired for putting XSL, which is XML, into an XML file?

Related

Is there a way to match strings:numbers with variable positioning within the string?

We are using a simple curl to get metrics via an API. The problem is, that the output is fixed in the amount of arguments but not their position within the output.
We need to do this with a "simple" regex since the tool only accepts this.
/"name":"(.*)".*?"memory":(\d+).*?"consumer_utilisation":(\w+|\d+).*?"messages_unacknowledged":(\d+).*?"messages_ready":(\d+).*?"messages":(\d+)/s
It works fine for:
{"name":"queue1","memory":89048,"consumer_utilisation":null,"messages_unacknowledged":0,"messages_ready":0,"messages":0}
However if the output order is changed, then it doesn't match any more:
{"name":"queue2","consumer_utilisation":null,"messages_unacknowledged":0,"messages_ready":0,"messages":0,"memory":21944}
{"name":"queue3","consumer_utilisation":null,"messages_unacknowledged":0,"messages_ready":0,"memory":21944,"messages":0}
I need a relative definition of the strings to match, since I never know at which position they will appear. Its in total 9 different queue-metric-groups.
The simple option is to use a regex for each key-value pair instead of one large regex.
/"name":"((?:[^\\"]|\\.)*)"/
/"memory":(\d+)/
This other option is not a regex, but might be sufficient. Instead of using regex, you could simply transform the resulting response before reading it. Since you say "We are using a simple curl" I'm guessing you're talking about the Curl command line tool. You could pipe the result into a simple Perl command.
perl -ne 'use JSON; use Text::CSV qw(csv); $hash = decode_json $_; csv (sep_char=> ";", out => *STDOUT, in => [[$hash->{name}, $hash->{memory}, $hash->{consumer_utilisation}, $hash->{messages_unacknowledged}, $hash->{messages_ready}, $hash->{messages}]]);'
This will keep the order the same, making it easier to use a regex to read out the data.
input
{"name":"queue1","memory":89048,"consumer_utilisation":null,"messages_unacknowledged":0,"messages_ready":0,"messages":0}
{"name":"queue2","consumer_utilisation":null,"messages_unacknowledged":0,"messages_ready":0,"messages":0,"memory":21944}
{"name":"queue3","consumer_utilisation":null,"messages_unacknowledged":0,"messages_ready":0,"memory":21944,"messages":0}
output
queue1;89048;;0;0;0
queue2;21944;;0;0;0
queue3;21944;;0;0;0
For this to work you need Perl and the packages JSON and Text::CSV installed. On my system they are present in perl, libjson-perl and libtext-csv-perl.
note: I'm currently using ; as separator. If this is included into one of the output will be surrounded by double quotes. "name":"que;ue1" => "que;ue1";89048;;0;0;0 If the value includes both a ; and a " the " will be escaped by placing another one before it. "name":"q\"ue;ue1" => "q""ue;ue1";89048;;0;0;0

Powershell: Read a section of a file into a variable

I'm trying to create a kind of a polyglot script. It's not a true polyglot because it actually requires multiple languages to perform, although it can be "bootstrapped" by either Shell or Batch. I've got this part down no problem.
The part I'm having trouble with is a bit of embedded Powershell code, which needs to be able to load the current file into memory and extract a certain section that is written in yet another language, store it in a variable, and finally pass it into an interpreter. I have an XML-like tagging system that I'm using to mark sections of the file in a way that will hopefully not conflict with any of the other languages. The markers look like this:
lang_a_code
# <{LANGB}>
... code in language B ...
... code in language B ...
... code in language B ...
# <{/LANGB}>
lang_c_code
The #'s are comment markers, but the comment markers can be different things depending on the language of the section.
The problem I have is that I can't seem to find a way to isolate just that section of the file. I can load the entire file into memory, but I can't get the stuff between the tags out. Here is my current code:
#ECHO OFF
SETLOCAL EnableDelayedExpansion
powershell -ExecutionPolicy unrestricted -Command ^
$re = '(?m)^<{LANGB}^>(.*)^<{/LANGB}^>';^
$lang_b_code = ([IO.File]::ReadAllText(^'%0^') -replace $re,'$1');^
echo "${re}";^
echo "Contents: ${lang_b_code}";
Everything I've tried so far results in the entire file being output in the Contents rather than just the code between the markers. I've tried different methods of escaping the symbols used in the markers, but it always results in the same thing.
NOTE: The use of the ^ is required because the top-level interpreter is Batch, which hangs up on the angle brackets and other random things.
Since there is just one block, you can use the regex
$re = '(?s)^<{LANGB}^>(.*)^^.*^<{/LANGB}^>';^
but with -match operator, and then access the text using $matches[1] variable that is set as a result of -match.
So, after the regex declaration, use
[IO.File]::ReadAllText(^'%0^') -match $re;^
echo $matches[1];

If Pattern matched in 1st line i need to remove the 4th line

Friends,
I need some help in regex pattern match and replace
I usually use %s/findstring/replacestring/g for the pattern match and replace in same line
But if my file is some thing like this
<tracker xid="tracker4795">
<title>MIC-DMI Change Requests</title>
<description>New tracker created </description>
<dateCreated>2010-05-03 15:18:10 EST</dateCreated>
<displayLines>1</displayLines>
<isRequired>false</isRequired>
I need to pattern match the <tracker xid.*> and escape all the lines until it match <displayLine.*> again if these match both the pattern i need to remove the
<isRequired>.*
Something like if pattern matched in both 4th and 6th line remove the 7th line
Kindly throw some light on how to achieve this
You have to match the entire set of lines. For that, note that . does not match a newline character; this must be explicitly specified via \n. With that, you have multiple options:
Match the entire block, use capture groups to excise the line
The pattern is more complex, but this is the general approach:
:%s/\(<tracker xid=.*\n\%(.*\n\)\{3}<displayLines>.*\n\)<isRequired.*\n/\1/g
Match the minimal block, delete separately
This just establishes a match via :global, then uses relative addressing to remove the line.
:g/<tracker xid=.*\n\%(.*\n\)\{3}<displayLines>.*/+5delete
Caveats
Only do this if you are absolutely sure that the XML source is in a consistent, well-known format. Text editors / regular expressions are a quick and ready tool for this, but fundamentally are the wrong tool. Be aware of this, and don't blame the tool when something goes wrong. Read more here. For production-grade reliability and automation, please use an XML tool (like XSL transformations).
When you say 'something like this' it looks like what you've got there is XML. I can't say for sure, because 'something like this' covers a lot of defects.
However if it is XML, it's a really bad idea to try and parse it with a regular expression. The reason being that XML is a defined data format with a quite strict specification. If everyone sticks to that spec, then all is fine and dandy.
However, if someone is assuming you will handle their XML as XML, and you're not (because you're using a regular expression), what you will be creating is a brittle piece of code that at some point in the future will just randomly break for no apparent reason - because they stuck to the XML spec, but changed something in an entirely valid way.
So assuming that it is XML, and looks 'something like' the example below - I would suggest using Perl and XML::Twig to parse your data.
#!/usr/bin/perl
use strict;
use warnings;
use XML::Twig;
my $xml;
{ local $/; $xml = <DATA> };
my $data = XML::Twig->new( pretty_print => 'indented' )->parse($xml);
foreach my $element ( $data->root->children('tracker') ) {
my $xid = $element->att('xid');
print $xid, "\n";
foreach my $subelement ( $element->children ) {
if ( $subelement->name eq 'isRequired' ) {
#delete the 'isRequired' line
$subelement->delete;
}
}
}
$data->print;
__DATA__
<xml>
<tracker xid="tracker4795">
<title>MIC-DMI Change Requests</title>
<description>New tracker created </description>
<dateCreated>2010-05-03 15:18:10 EST</dateCreated>
<displayLines>1</displayLines>
<isRequired>false</isRequired>
</tracker>
</xml>
If you know the input is in the example format (with only one open-tag per line, and all tracker tags contain a displaylines and isrequired tag), or you can force it to that format, then I think a search-and-replace is too unwieldy, and full XML parsing is "correct" but way more complicated than you need, and you should try a simpler method with the :g command:
:g#<tracker xid#/<displayLine/d
This just searches for lines matching "<tracker xid", then deletes the next line after that matching "<displayLine"
Thus you don't need a specific number of lines in between "<tracker" and "<displayLine" so it is more robust to variances in line offsets, but it is still quite fragile to format changes.
However, I repeat the warnings from others: if the format is not easily and consistently predictable then I'd suggest parsing the file line by line in a loop, or using a real XML parser (possibly using Vim's Perl or Python integration), rather than using an :s or :g command.

When and why did the output of qr() change?

The output of perl's qr has changed, apparently sometime between versions 5.10.1 and 5.14.2, and the change is not documented--at least not fully.
To demonstrate the change, execute the following one-liner on each version:
perl -e 'print qr(foo)is."\n"'
Output from perl 5.10.1-17squeeze6 (Debian squeeze):
(?-xism:foo)
Output from perl 5.14.2-21+deb7u1 (Debian wheezy):
(?^:foo)
The perl documentation (perldoc perlop) says:
$rex = qr/my.STRING/is;
print $rex; # prints (?si-xm:my.STRING)
s/$rex/foo/;
which appears to no longer be true:
$ perl -e 'print qr/my.STRING/is."\n"'
(?^si:my.STRING)
I would like to know when this change occurred (which version of Perl, or supporting library or whatever).
Some background, in case it's relevant:
This change has caused a bunch of unit tests to fail. I need to decide if I should simply update the unit tests to reflect the new format, or make the tests dynamic enough to support both formats, etc. To make an informed decision, I would like to understand why the change took place. Knowing when and where it took place seems like the best place to start in that investigation.
It's documented in perl5140delta:
Regular Expressions
(?^...) construct signifies default modifiers
[...] Stringification of regular expressions now uses this notation. [...]
This change is likely to break code that compares stringified regular expressions with fixed strings containing ?-xism.
The function regexp_pattern can be used to parse the modifiers for normalisation purposes.
Part of the reason this was added, was that regular expressions were getting quite a few new modifiers.
Your example would actually produce something like this if that change didn't happen:
(?d-xismpaul:foo)
That also doesn't really express the modifiers in place.
d/u/l can only be added to a regex, not subtracted like i.
They are also mutually exclusive.
a/aa There are actually two levels for this modifier.
While work went underway adding these modifiers it was determined that this will break quite a few tests on CPAN modules.
Seeing as the tests were going to break anyway, it was agreed upon that there should be a way of specifying just use the defaults ((?^:…)).
That way, the tests wouldn't have to updated every time a new modifier was added.
To receive the stringified form of a regexp you can use Regexp::Parser and its qr method. Using this module you can not only test the representation of a regexp, but also walk a tree.

grep replacement with extensive regular expression implementation

I have been using grepWin for general searching of files, and wingrep when I want to do replacements or what-have-you.
GrepWin has an extensive implementation of regular expressions, however doesn't do replacements (as mentioned above).
Wingrep does replacements, however has a severely limited range of regular expression implementation.
Does anyone know of any (preferably free) grep tools for windows that does replacement AND has a reasonable implementation of regular expressions?
Thanks in advance.
I think perl at the command line is the answer you are looking for. Widely portable, powerful regex support.
Let's say that you have the following file:
foo
bar
baz
quux
you can use
perl -pne 's/quux/splat!/' -i /tmp/foo
to produce
foo
bar
baz
splat!
The magic is in Perl's command line switches:
-e: execute the next argument as a perl command.
-n: execute the command on every line
-p: print the results of the command, without issuing an explicit
'print' statement.
-i: make substitutions in place. overwrite the document with the
output of your command... use with caution.
I use Cygwin quite a lot for this sort of task.
Unfortunately it has the world's most unintuitive installer, but once it's installed correctly it's very usable... well apart from a few minor issues with copy and paste and the odd issue with line-endings.
The good thing is that all the tools work like on a real GNU system, so if you're already familiar with Linux or similar, you don't have to learn anything new (apart from how to use that crazy installer).
Overall I think the advantages make up for the few usability issues.
If you are on Windows, you can use vbscript (requires no downloads). It comes with regex. eg change "one" to "ONE"
Set objFS=CreateObject("Scripting.FileSystemObject")
Set WshShell = WScript.CreateObject("WScript.Shell")
Set objArgs = WScript.Arguments
strFile = objArgs(0)
Set objFile = objFS.OpenTextFile(strFile)
strFileContents = objFile.ReadAll
Set objRE = New RegExp
objRE.Global = True
objRE.IgnoreCase = False
objRE.Pattern = "one"
strFileContents = objRE.Replace(strFileContents,"ONE") 'simple replacement
WScript.Echo strFileContents
output
C:\test>type file
one
two one two
three
C:\test>cscript //nologo test.vbs file
ONE
two ONE two
three
You can read up vbscript doc to learn more on using regex