Edit all files in directory tree with regular expression on Windows - regex

I am looking for a program that can edit all files in directory tree like Perl on Unix systems. The files are xml's and another folders.
The regex should delete all the content placed in <loot></loot> brackets.
for example file
<?xml version="1.0" encoding="UTF-8"?>
<monster name="Dragon"/>
<health="10000"/>
<immunities>
<immunity fire="1"/>
</immunities>
<loot>
<item id="1"/>
<item id="3"/>
<inside>
<item id="6"/>
</inside>
</item>
</loot>
the file should look after edit:
<?xml version="1.0" encoding="UTF-8"?>
<monster name="Dragon"/>
<health="10000"/>
<immunities>
<immunity fire="1"/>
</immunities>
<loot>
</loot>

I would shy away from anything regex based - XML simply doesn't work with regular expressions.
But fortunately, Perl for Windows is readily available. And better yet, if you go with Strawberry perl, it comes bundled with both XML::Twig and XML::LibXML.
At which point the problem becomes inanely simple:
#!/usr/bin/perl
use warnings;
use strict;
use File::Find::Rule;
use XML::Twig;
sub delete_loot {
my ( $twig, $loot ) = #_;
foreach my $loot_entry ( $loot -> children ) {
$loot_entry -> delete;
}
$twig -> flush;
}
my $twig = XML::Twig -> new ( pretty_print => 'indented',
twig_handlers => { 'loot' => \&delete_loot ,
'_all_' => sub { $_ - > flush } } );
foreach my $file ( File::Find::Rule -> file()
-> name ( '*.xml.txt' )
-> in ( 'C:\tmp' ) ) {
print "Processing $file\n";
$twig -> parsefile_inplace($file);
}
Of course, this also assumes that your XML is, in fact, XML - which your example isn't. If that example is actually correct, then you should really hit whoever wrote it around the head with a rolled up copy of the XML Spec whilst chanting 'don't make fake XML'.

Related

Unable to figure out how to replace a tag in xml

Need help to figure out correct regex for replacing xml tag with contents of a file.
Tried basic things like escaping special characters but no luck. Open to using something else other than sed.
config.txt
<localReplications/>
replace-with-config.txt
<localReplications>
<localReplication>
<enabled>true</enabled>
<cronExp>0 0 /5 * * ?</cronExp>
<syncDeletes>true</syncDeletes>
<syncProperties>true</syncProperties>
<repoKey>some-repo-key</repoKey>
<url>https://foo.bar/random</url>
<socketTimeoutMillis>15000</socketTimeoutMillis>
<username>foo</username>
<password>bar</password>
<enableEventReplication>true</enableEventReplication>
<syncStatistics>false</syncStatistics>
</localReplication>
</localReplications>
<localReplications/> tag is part of really complicated xml file. I expect <localReplications/> to be replaced with contents in replace-with-config.txt
use XML::LibXML qw();
my $config = XML::LibXML
->load_xml(location => 'config.txt');
my $replace = (XML::LibXML
->load_xml(location => 'replace-with-config.txt')
->findnodes('//localReplications')
)[0];
for my $local_replications (
$config->findnodes('//localReplications')
) {
# $local_replications->replaceNode($replace);
# this fails with HIERARCHY_REQUEST_ERR,
# so do it in two steps instead
$local_replications->addSibling($replace);
$local_replications->unbindNode;
}
print $config->toString;

How to print out matached string with perlre?

Perlre (Perl Regular Expression) is used for searching / replacing complex XML structure, e.g.
perl -0777 -pe 's/<ac:structured-macro ac:macro-id="[a-z0-9\-]+" ac:name="gadget" ac:schema-version="1"><ac:parameter ac:name="preferences">.*?selectedIssueKey=([A-Z\-0-9]+).*?(<\/ac:parameter>)<ac:parameter ac:name="url">https:\/\/rcrs.rbinternational.corp\/issue\/rest\/gadgets\/1.0\/g\/com.pyxis.greenhopper.jira:greenhopper-card-view-gadget\/gadgets\/greenhopper-card-view.xml<\/ac:parameter>.*?(<\/ac:structured-macro>)/<ac:structured-macro ac:name="jira" ac:schema-version="1"><ac:parameter ac:name="server">RCRS Issue Tracking<\/ac:parameter><ac:parameter ac:name="columns">key,type,assignee,status,nwu model developer,nwu model reviewer,nwu model owner,nwu head rr\/b2\/cro,ho validation owner,ho pi\/micro country manager,ho responsible signee<\/ac:parameter><ac:parameter ac:name="maximumIssues">20<\/ac:parameter><ac:parameter ac:name="jqlQuery">key = \1 <\/ac:parameter><ac:parameter ac:name="serverId">d64129aa-b1e8-3584-8953-2bd89c3e515c<\/ac:parameter><\/ac:structured-macro>/igs' macro
Currently the search pattern does not match as expected. Obviously the search patterns spans more string than expected. Is there a possibility to print out each matched string for debugging purpose?
You can turn on debugging for regex with use re 'debug';. However that would be the wrong approach to take. Your problem here isn't your regex is wrong, it's that regex is fundamentally the wrong tool for XML parsing. (And leaving aside that - your line is just too long to be sensible to use inline like that!)
Given your example - it looks like you're trying to extract a single value (selectedIssueKey) and insert it into a new blob of XML.
This is done much easier by a parser, such as XML::Twig. I can't give you a precise example, because I would need to see your XML structure (or at least a subset without the wildcards).
But something like this can be used for extracting a value from some XML:
#!/usr/bin/perl
use strict;
use warnings;
use XML::Twig;
my $twig = XML::Twig -> parse ( \*DATA );
my $selectedIssueKey = $twig -> findnodes ( '//ac:parameter/pref', 0) -> att('selectedIssueKey');
print $selectedIssueKey;
Extracts the value of an attribute 'selectedIssueKey' from:
<ac:structured-macro ac:macro-id="test" ac:name="gadget" ac:schema-version="1">
<ac:parameter ac:name="preferences">
<pref selectedIssueKey="anothertest" />
</ac:parameter>
<ac:parameter ac:name="url">https://rcrs.rbinternational.corp/issue/rest/gadgets/1.0/g/com.pyxis.greenhopper.jira:greenhopper-card-view-gadget/gadgets/greenhopper-card-view.xml</ac:parameter>
</ac:structured-macro>
XML::Twig also lets you cut and paste, so you could do something like this:
#!/usr/bin/perl
use strict;
use warnings;
use XML::Twig;
my $twig = XML::Twig -> new ( 'pretty_print' => 'indented_a' ) -> parsefile 'sample.xml' );
my $selectedIssueKey = $twig -> findnodes ( '//ac:parameter/pref', 0) -> att('selectedIssueKey');
print "Found key of: $selectedIssueKey\n";
my $ac_structured_macro = $twig -> findnodes ( '//ac:structured-macro',0 );
my $new_macro = $twig -> root -> insert_new_elt( 'last_child', 'ac:structured-macro', { "ac:name" => "jira", "ac:schema-version"=> "1" } );
$new_macro -> insert_new_elt('last_child', 'ac:parameter', { 'ac:name' => 'server' }, "RCRS ISSUE Tracking" );
$new_macro -> insert_new_elt('last_child', 'ac:parameter', { 'ac:name' => 'columns' }, "key,type,assignee,status,etc" );
$new_macro -> insert_new_elt('last_child', 'ac:parameter', { 'ac:name' => 'maximumIssues' }, "20" );
$new_macro -> insert_new_elt('last_child', 'ac:parameter', { 'ac:name' => 'jqlQuery' }, "key = $selectedIssueKey" );
$new_macro -> insert_new_elt('last_child', 'ac:parameter', { 'ac:name' => 'serverId' }, "d64129aa-b1e8-3584-8953-2bd89c3e515c" );
$ac_structured_macro -> delete;
$twig -> print;
(It's probably easier to use a whole XML snippet for this though, and just replace the bits yo uwant).
use re 'debug'
To debug you regex, read more about it here http://perldoc.perl.org/re.html#%27debug%27-mode
Use a CPAN module for this. May save you some headache.
But if you still want to use regex for the job, I'd suggest you expand the regex in a script and step through it using the debugger like mentioned above.

Extract String between { } in Perl

I have this all of the below stored in $data .
'Berry-Berry Belgian Waffles' => {
'calories' => '900',
'price' => '$8.95',
'description' => 'Light Belgian waffles covered with an assortment of fresh berries and whipped cream'
},
I need to extract the contents in between the '{' and '}' using regular expression. So, the result should be as follows.
'calories' => '900',
'price' => '$8.95',
'description' => 'Light Belgian waffles covered with an assortment of fresh berries and whipped cream'
How do I achieve this using perl script?
This is the script I have so far, it reads from an xml file whether it's on the web or a local file.
use XML::Simple;
use LWP;
use Data::Dumper;
#request path
print "Enter path\n";
my $input = <STDIN>;
my $data;
chomp $input;
print "Path : $input\n";
if ($input =~ /http/)
{
print "This is a webpage\n";
my $ua = LWP::UserAgent->new;
my $req = HTTP::Request->new( GET => $input );
my $res = $ua->request( $req );
print Dumper (XML::Simple->new()->XMLin( $res->content ));
}
else
{
print "This is a local path\n";
$xml = new XML::Simple;
$data = $xml ->XMLin($input);
print Dumper($data);
}
print "Type in keyword to search: \n";
my $inputsearch = <STDIN>;
chomp $inputsearch;
print "You typed --> $inputsearch\n";
Dumper($data) =~ m/$inputsearch/;
$after = "$'";
$result = $after =~ /{...}/;
print $result;
OK, seriously. Please don't use XML::Simple. Even XML::Simple says:
The use of this module in new code is discouraged. Other modules are available which provide more straightforward and consistent interfaces.
I'm going to make a guess at how your XML looks, and give you an idea how to extract information from it. I'll update if you can give a better example of the XML.
<root>
<item name="Berry-Berry Belgian Waffles">
<calories>900</calories>
<price>$8.95</price>
<description>Light Belgian waffles covered with an assortment of fresh berries and whipped cream</description>
</item>
</root>
And you can process it like this:
#!/usr/bin/env perl
use strict;
use warnings;
use XML::Twig;
my $twig = XML::Twig->new( 'pretty_print' => 'indented' );
$twig->parse( \*DATA );
foreach my $item ( $twig -> get_xpath ( '//item' ) ) {
print "Name: ", $item -> att('name'),"\n";
foreach my $element ( $item -> children ) {
print $element -> tag,": ", $element -> trimmed_text,"\n";
}
}
__DATA__
<root>
<item name="Berry-Berry Belgian Waffles">
<calories>900</calories>
<price>$8.95</price>
<description>Light Belgian waffles covered with an assortment of fresh berries and whipped cream</description>
</item>
</root>
With XML::Twig you can access "attributes" via att, the element name via tag and the content via text or trimmed_text.
So the above will print:
Name: Berry-Berry Belgian Waffles
calories: 900
price: $8.95
description: Light Belgian waffles covered with an assortment of fresh berries and whipped cream

Generic solution for removing xml declararation using perl

Hi i want remove the declaration in my xml file and problem is declaration is sometimes embed with the root element.
XML looks as follows
Case1:
<?xml version="1.0" encoding="UTF-8"?> <document> This is a document root
<child>----</child>
</document>`
Case 2:
<?xml version="1.0" encoding="UTF-8"?>
<document> This is a document root
<child>----</child>
</document>`
Function should also work for the case when root node is in next line.
My function works only for case 2..
sub getXMLData {
my ($xml) = #_;
my #data = ();
open(FILE,"<$xml");
while(<FILE>) {
chomp;
if(/\<\?xml\sversion/) {next;}
push(#data, $_);
}
close(FILE);
return join("\n",#data);
}
*** Please note that encoding is not constant always.
OK, so the problem here is - you're trying to parse XML line based, and that DOESN'T WORK. You should avoid doing it, because it makes brittle code, which will one day break - as you've noted - thanks to perfectly valid changes to the source XML. Both your documents are semantically identical, so the fact your code handles one and not the other is an example of exactly why doing XML this way is a bad idea.
More importantly though - why are you trying to remove the XML declaration from your XML? What are you trying to accomplish?
Generically reformatting XML can be done like this:
#!/usr/bin/perl
use strict;
use warnings;
use XML::Twig;
my $twig = XML::Twig->new(
pretty_print => 'indented',
);
$twig->parsefile('your_xml_file');
$twig->print;
This will parse your XML and reformat it in one of the valid ways XML may be formatted. However I would strongly urge you not to just discard your XML declaration, and instead carry on with something like XML::Twig to process it. (Open a new question with what you're trying to accomplish, and I'll happily give you a solution that doesn't trip up with different valid formats of XML).
When it comes to merging XML documents, XML::Twig can do this too - and still check and validate your XML as it goes.
So you might do something like (extending from the above):
foreach my $file ( #file_list ) {
my $child = XML::Twig -> new ();
$child -> parsefile ( $xml_file );
my $child_doc = $child -> root -> cut;
$child_doc -> paste ( $twig -> root );
}
$twig -> print;
Exactly what you'd need to do, depends a little on your desired output structure - you'd need 'wrap' in the root element anyway. Open a new question with some sample input and desired output, and I'll happily take a crack at it.
As an example - if you feed the above your sample input twice, you get:
<?xml version="1.0" encoding="UTF-8"?>
<document><document> This is a document root
<child>----</child></document> This is a document root
<child>----</child></document>
Which I know isn't likely to be what you want, but hopefully illustrates a parser based way of XML restructuring.

Parsing XML file with perl - regex

i'm just a begginer in perl, and very urgently need to prepare a small script that takes top 3 things from an xml file and puts them in a new one.
Here's an example of an xml file:
<article>
{lot of other stuff here}
</article>
<article>
{lot of other stuff here}
</article>
<article>
{lot of other stuff here}
</article>
<article>
{lot of other stuff here}
</article>
What i'd like to do is to get first 3 items along with all the tags in between and put it into another file.
Thanks for all the help in advance
regards
peter
Never ever use Regex to handle markup languages.
The original version of this answer (see below) used XML::XPath. Grant McLean said in the comments:
XML::XPath is an old and unmaintained module. XML::LibXML is a modern, maintained module with an almost identical API and it's faster too.
so I made a new version that uses XML::LibXML (thanks, Grant):
use warnings;
use strict;
use XML::LibXML;
my $doc = XML::LibXML->load_xml(location => 'articles.xml');
my $xp = XML::LibXML::XPathContext->new($doc->documentElement);
my $xpath = '/articles/article[position() < 4]';
foreach my $article ( $xp->findnodes($xpath) ) {
# now do something with $article
print $article.": ".$article->getName."\n";
}
For me this prints:
XML::LibXML::Element=SCALAR(0x346ef90): article
XML::LibXML::Element=SCALAR(0x346ef30): article
XML::LibXML::Element=SCALAR(0x346efa8): article
Links to the relevant documentation:
The type of $doc will be XML::LibXML::Document.
The type of $xp is XML::LibXML::XPathContext.
The return type of $xp->findnodes() is XML::LibXML::NodeList.
The type $article is XML::LibXML::Element.
Original version of the answer, based on the XML::XPath package:
use warnings;
use strict;
use XML::XPath;
my $xp = XML::XPath->new(filename => 'articles.xml');
my $xpath = '/articles/article[position() < 4]';
foreach my $article ( $xp->findnodes($xpath)->get_nodelist ) {
# now do something with $article
print $article.": ".$article->getName ."\n";
}
which prints this for me:
XML::XPath::Node::Element=REF(0x38067b8): article
XML::XPath::Node::Element=REF(0x38097e8): article
XML::XPath::Node::Element=REF(0x3809ae8): article
The type of $xp is XML::XPath, obviously.
The return type of $xp->findnodes() is XML::XPath::NodeSet.
The type of $article will be XML::XPath::Node::Element in this case.
Have a look at the docs to find out what you can do with them.
Here:
open my $input, "<", "file.xml" or die $!;
open my $output, ">", "truncated-file.xml" or die $!;
my $n_articles = 0;
while (<$input>) {
print $output $_;
if (m:</article>:) {
$n_articles++;
if ($n_articles >= 3) {
last;
}
}
}
close $input or die $!;
close $output or die $!;
You really don't need an XML parser to do such a simple job.