Generic solution for removing xml declararation using perl - regex

Hi i want remove the declaration in my xml file and problem is declaration is sometimes embed with the root element.
XML looks as follows
Case1:
<?xml version="1.0" encoding="UTF-8"?> <document> This is a document root
<child>----</child>
</document>`
Case 2:
<?xml version="1.0" encoding="UTF-8"?>
<document> This is a document root
<child>----</child>
</document>`
Function should also work for the case when root node is in next line.
My function works only for case 2..
sub getXMLData {
my ($xml) = #_;
my #data = ();
open(FILE,"<$xml");
while(<FILE>) {
chomp;
if(/\<\?xml\sversion/) {next;}
push(#data, $_);
}
close(FILE);
return join("\n",#data);
}
*** Please note that encoding is not constant always.

OK, so the problem here is - you're trying to parse XML line based, and that DOESN'T WORK. You should avoid doing it, because it makes brittle code, which will one day break - as you've noted - thanks to perfectly valid changes to the source XML. Both your documents are semantically identical, so the fact your code handles one and not the other is an example of exactly why doing XML this way is a bad idea.
More importantly though - why are you trying to remove the XML declaration from your XML? What are you trying to accomplish?
Generically reformatting XML can be done like this:
#!/usr/bin/perl
use strict;
use warnings;
use XML::Twig;
my $twig = XML::Twig->new(
pretty_print => 'indented',
);
$twig->parsefile('your_xml_file');
$twig->print;
This will parse your XML and reformat it in one of the valid ways XML may be formatted. However I would strongly urge you not to just discard your XML declaration, and instead carry on with something like XML::Twig to process it. (Open a new question with what you're trying to accomplish, and I'll happily give you a solution that doesn't trip up with different valid formats of XML).
When it comes to merging XML documents, XML::Twig can do this too - and still check and validate your XML as it goes.
So you might do something like (extending from the above):
foreach my $file ( #file_list ) {
my $child = XML::Twig -> new ();
$child -> parsefile ( $xml_file );
my $child_doc = $child -> root -> cut;
$child_doc -> paste ( $twig -> root );
}
$twig -> print;
Exactly what you'd need to do, depends a little on your desired output structure - you'd need 'wrap' in the root element anyway. Open a new question with some sample input and desired output, and I'll happily take a crack at it.
As an example - if you feed the above your sample input twice, you get:
<?xml version="1.0" encoding="UTF-8"?>
<document><document> This is a document root
<child>----</child></document> This is a document root
<child>----</child></document>
Which I know isn't likely to be what you want, but hopefully illustrates a parser based way of XML restructuring.

Related

Unable to figure out how to replace a tag in xml

Need help to figure out correct regex for replacing xml tag with contents of a file.
Tried basic things like escaping special characters but no luck. Open to using something else other than sed.
config.txt
<localReplications/>
replace-with-config.txt
<localReplications>
<localReplication>
<enabled>true</enabled>
<cronExp>0 0 /5 * * ?</cronExp>
<syncDeletes>true</syncDeletes>
<syncProperties>true</syncProperties>
<repoKey>some-repo-key</repoKey>
<url>https://foo.bar/random</url>
<socketTimeoutMillis>15000</socketTimeoutMillis>
<username>foo</username>
<password>bar</password>
<enableEventReplication>true</enableEventReplication>
<syncStatistics>false</syncStatistics>
</localReplication>
</localReplications>
<localReplications/> tag is part of really complicated xml file. I expect <localReplications/> to be replaced with contents in replace-with-config.txt
use XML::LibXML qw();
my $config = XML::LibXML
->load_xml(location => 'config.txt');
my $replace = (XML::LibXML
->load_xml(location => 'replace-with-config.txt')
->findnodes('//localReplications')
)[0];
for my $local_replications (
$config->findnodes('//localReplications')
) {
# $local_replications->replaceNode($replace);
# this fails with HIERARCHY_REQUEST_ERR,
# so do it in two steps instead
$local_replications->addSibling($replace);
$local_replications->unbindNode;
}
print $config->toString;

Edit all files in directory tree with regular expression on Windows

I am looking for a program that can edit all files in directory tree like Perl on Unix systems. The files are xml's and another folders.
The regex should delete all the content placed in <loot></loot> brackets.
for example file
<?xml version="1.0" encoding="UTF-8"?>
<monster name="Dragon"/>
<health="10000"/>
<immunities>
<immunity fire="1"/>
</immunities>
<loot>
<item id="1"/>
<item id="3"/>
<inside>
<item id="6"/>
</inside>
</item>
</loot>
the file should look after edit:
<?xml version="1.0" encoding="UTF-8"?>
<monster name="Dragon"/>
<health="10000"/>
<immunities>
<immunity fire="1"/>
</immunities>
<loot>
</loot>
I would shy away from anything regex based - XML simply doesn't work with regular expressions.
But fortunately, Perl for Windows is readily available. And better yet, if you go with Strawberry perl, it comes bundled with both XML::Twig and XML::LibXML.
At which point the problem becomes inanely simple:
#!/usr/bin/perl
use warnings;
use strict;
use File::Find::Rule;
use XML::Twig;
sub delete_loot {
my ( $twig, $loot ) = #_;
foreach my $loot_entry ( $loot -> children ) {
$loot_entry -> delete;
}
$twig -> flush;
}
my $twig = XML::Twig -> new ( pretty_print => 'indented',
twig_handlers => { 'loot' => \&delete_loot ,
'_all_' => sub { $_ - > flush } } );
foreach my $file ( File::Find::Rule -> file()
-> name ( '*.xml.txt' )
-> in ( 'C:\tmp' ) ) {
print "Processing $file\n";
$twig -> parsefile_inplace($file);
}
Of course, this also assumes that your XML is, in fact, XML - which your example isn't. If that example is actually correct, then you should really hit whoever wrote it around the head with a rolled up copy of the XML Spec whilst chanting 'don't make fake XML'.

XML to JSON conversion using perl

I am using perl for converting an XML file JSON. I was able to write the code for converting the input XML file to Json format.
Here is the sample xml:
<?xml version="1.0" encoding="UTF-8" standalone="no" ?>
<Person SchemaVersion="1.0.8">
<ID>0</ID>
<Home ID="ABC-XYZ" State="Unknown">
<Location>
<SiteName></SiteName>
<Number>62</Number>
<MaxSize>0</MaxSize>
<Comment></Comment>
</Location>
<Laptop>
<FileName>/usr/temp/RPM_020515_.tar.gz</FileName>
</Laptop>
</Home>
</Person>
sample perl code doing json conversion:
#!/usr/bin/perl
use JSON;
use XML::Simple;
# Create the object of XML Simple
my $xmlSimple = new XML::Simple(KeepRoot => 1);
# Load the xml file in object
my $dataXML = $xmlSimple->XMLin("input/path/to/above/XMLFILE");
# use encode json function to convert xml object in json.
my $jsonString = encode_json($dataXML);
# finally print json
print $jsonString;
Output JSON value:
{
"Person": {
"ID": "0",
"SchemaVersion": "1.0.8",
"Home": {
"ID": "ABC-XYZ",
"Laptop": {
"FileName": "/usr/temp/RPM_020515_.tar.gz"
},
"Location": {
"Number": "62",
"MaxSize": "0",
"Comment": { },
"SiteName": { }
},
"State": "Unknown"
}
}
}
Above code is working fine.
My question is and this is where i am actually stuck.
I need to do one more thing along with JSON conversion which is checking if element "FileName" in XML is empty or not. if its not extract its value as well.
So output will be two things:
1. XML To JSON convert ( working fine )
2. Along with point 1. extract the value of nested element in XML
"FileName". I need that for some business logic need in next step.
Can some perl experts help me here and suggest me how i can do that in my current perl code.
Thanks for helping in advance.
This is my first perl script so please excuse me if this is a too trivial question to ask.
Tried reading the perl docs but not that helpful.
NOTE: I am trying to use only perl built in libraries not any new third party libraries that is the reason i used XML::Simple. Its production code restriction ( very bad boundation ). I hope something for this exists in XML::Simple or JSON.
As XML::Simple docs note, it 'slurps' the XML into a data structure analogous to the input XML. Thus, to figure out the contents of the XML, just treat it as the hash reference it likely is:
if ($dataXML->{Person}{Home}{ID} eq 'foo') {
# Some action
...
}
You should use XML::LibXML instead of XML::Simple. Then use xpath to query each document. I'd take a look at the code of XML::XML2JSON to see if it can be a good fit for the problem…

XML Name space issue revisited

XML Name space issue revisited:
I am still not able to find a good solution to the problem that the findnode or findvalue does not work when we have xmlns has some value.
The moment I set manually xmlns="", it starts working. At least in my case. Now I need to automate this.
consider this
< root xmlns="something" >
--
---
< /root>
My recommended solution :
dynamically set the value to xmlns=""
and when the work is done automatically we can reset to the original value xmlns="something"
And this seems to be a working solution for my XMLs only but its stll manual.
I need to automate this:
How to do it 2 options:
using Perl regex, or
using proper LibXML setNamespace etc.
Please put your thought in this context.
You register the namespace. The point of XML is not having to kludge around with regexes!
Besides, it's easier: you create an XML::LibXML::XPathContext, register your namespaces, and use its find* calls with your chosen prefixes.
The following example is verbatim from a script of mine to list references in Visual Studio projects:
(...)
# namespace handling, see the XML::LibXML::Node documentation
my $xpc = new XML::LibXML::XPathContext;
$xpc->registerNs( 'msb',
'http://schemas.microsoft.com/developer/msbuild/2003' );
(...)
my $tree; eval { $tree = $parser->parse_file($projfile) };
(...)
my $root = $tree->getDocumentElement;
(...)
foreach my $attr ( find( '//msb:*/#Include', $root ) )
{
(...)
}
(...)
sub find { $xpc->find(#_)->get_nodelist; }
(...)
That's all it takes!
I only have one xmlns attribuite at the top of the XML once only so this works for me.
All I did was first to remove the namespace part i.e. remove the xmlns from my XML file.
NODE : for my $node ($conn->findnodes("//*[name()='root']")) {
my $att = $node->getAttribute('xmlns');
$node->setAttribute('xmlns', "");
last NODE;
}
using last just to make sure i come of the for loop in time.
And then once I am done with the XML parsing I will replace the
<root>
with
<root xmlns="something">
using simple Perl file operation or sed editor.

Format XML file in c++ or Qt

I have an XML file where outputs are not getting formatted. That means all the outputs are in a single line but I want to break it tag by tag.
For e.g. -
<?xml version="1.0" encoding="UTF-8" standalone="no" ?><Analyser> <JointDetails> <Details><StdThickness> T </StdThickness><Thickness_num> 0.032 </Thickness_num></Details> </JointDetails></Analyser>
But i want to do it like this ::
<?xml version="1.0" encoding="UTF-8" standalone="no" ?>
<Analyser>
<JointDetails>
<Details>
<StdThickness> T </StdThickness>
<Thickness_num> 0.032 </Thickness_num>
</Details>
</JointDetails>
</Analyser>
Please don't suggest to do it while writing the XML file because this XML file is already there but now I have to format it as mentioned above.
Using a QXmlStreamReader and QXmlStreamWriter should do what you want. QXmlStreamWriter::setAutoFormatting(true) will format the XML on different lines and use the correct indentation. With QXmlStreamReader::isWhitespace() you can filter out superfluous whitespace between tags.
QString xmlIn = "<?xml version=\"1.0\" encoding=\"UTF-8\" standalone=\"no\" ?>"
"<Analyser><JointDetails> <Details><StdThickness>"
" T </StdThickness><Thickness_num> 0.032 </Thickness_num>"
"</Details> </JointDetails></Analyser>";
QString xmlOut;
QXmlStreamReader reader(xmlIn);
QXmlStreamWriter writer(&xmlOut);
writer.setAutoFormatting(true);
while (!reader.atEnd()) {
reader.readNext();
if (!reader.isWhitespace()) {
writer.writeCurrentToken(reader);
}
}
qDebug() << xmlOut;
If you're using Qt, you can read it with QXmlStreamReader and write it with QXmlStreamWriter, or parse it as QDomDocument and convert that back to QString. Both QXmlStreamWriter and QDomDocument support formatting.
void format(void)
{
QDomDocument input;
QFile inFile("D:/input.xml");
QFile outFile("D:/output.xml");
inFile.open(inFile.Text | inFile.ReadOnly);
outFile.open(outFile.Text | outFile.WriteOnly);
input.setContent(&inFile);
QDomDocument output(input);
QTextStream stream(&outFile);
output.save(stream, 2);
}
If you want a simple robust solution that does not rely on QT, you can use libxml2. (If you are using QT anyway, just use what Frank Osterfeld said.)
xmlDoc* xdoc = xmlReadFile(BAD_CAST"myfile.xml", NULL, NULL, 0);
xmlSaveFormatFile(BAD_CAST"myfilef.xml", xdoc, 1);
xmlFreeDoc(xdoc);
Can I interest you in my C++ wrapper of libxml2?
Edit:
If you happen to have the XML string in memory, you may also use xmlReadDoc... But it doesn't stop there.
Utilising C++ you can add a single character between each instance of >< for output:
by changing >< to >\n< (this adds the non-printing character for a newline) each tag will print onto a new line. There are API ways to do this however as mentioned above, but for a simple way to do what you suggest for console output, or so that the XML flows onto new lines per tag in something like a text editor, the \n should work fine.
If you need a more elegant output, you can code a method yourself using \n (newline) and \t (tab) to lay out your output, or utilise an api if you reeqire a more elaborate representation.