Parsing XML file with perl - regex - regex

i'm just a begginer in perl, and very urgently need to prepare a small script that takes top 3 things from an xml file and puts them in a new one.
Here's an example of an xml file:
<article>
{lot of other stuff here}
</article>
<article>
{lot of other stuff here}
</article>
<article>
{lot of other stuff here}
</article>
<article>
{lot of other stuff here}
</article>
What i'd like to do is to get first 3 items along with all the tags in between and put it into another file.
Thanks for all the help in advance
regards
peter

Never ever use Regex to handle markup languages.
The original version of this answer (see below) used XML::XPath. Grant McLean said in the comments:
XML::XPath is an old and unmaintained module. XML::LibXML is a modern, maintained module with an almost identical API and it's faster too.
so I made a new version that uses XML::LibXML (thanks, Grant):
use warnings;
use strict;
use XML::LibXML;
my $doc = XML::LibXML->load_xml(location => 'articles.xml');
my $xp = XML::LibXML::XPathContext->new($doc->documentElement);
my $xpath = '/articles/article[position() < 4]';
foreach my $article ( $xp->findnodes($xpath) ) {
# now do something with $article
print $article.": ".$article->getName."\n";
}
For me this prints:
XML::LibXML::Element=SCALAR(0x346ef90): article
XML::LibXML::Element=SCALAR(0x346ef30): article
XML::LibXML::Element=SCALAR(0x346efa8): article
Links to the relevant documentation:
The type of $doc will be XML::LibXML::Document.
The type of $xp is XML::LibXML::XPathContext.
The return type of $xp->findnodes() is XML::LibXML::NodeList.
The type $article is XML::LibXML::Element.
Original version of the answer, based on the XML::XPath package:
use warnings;
use strict;
use XML::XPath;
my $xp = XML::XPath->new(filename => 'articles.xml');
my $xpath = '/articles/article[position() < 4]';
foreach my $article ( $xp->findnodes($xpath)->get_nodelist ) {
# now do something with $article
print $article.": ".$article->getName ."\n";
}
which prints this for me:
XML::XPath::Node::Element=REF(0x38067b8): article
XML::XPath::Node::Element=REF(0x38097e8): article
XML::XPath::Node::Element=REF(0x3809ae8): article
The type of $xp is XML::XPath, obviously.
The return type of $xp->findnodes() is XML::XPath::NodeSet.
The type of $article will be XML::XPath::Node::Element in this case.
Have a look at the docs to find out what you can do with them.

Here:
open my $input, "<", "file.xml" or die $!;
open my $output, ">", "truncated-file.xml" or die $!;
my $n_articles = 0;
while (<$input>) {
print $output $_;
if (m:</article>:) {
$n_articles++;
if ($n_articles >= 3) {
last;
}
}
}
close $input or die $!;
close $output or die $!;
You really don't need an XML parser to do such a simple job.

Related

How to parse a json file without using json library and json parser

Below are the input and Output Details
Input : This is just a sample input and real input is huge and is just in a single line
[{"mnemonic":"PT.IA1","ID":"000628"}, {"mnemonic":"EOR.1","ID":"000703"}]
code : I'm trying to read the file by setting the delimiter as }, so that I get each value,but as its a single line file, its printing everything at one, how do I parse this line by setting a delimiter to this line , is split function
enough to do this job ?
our $conf =
{
chunk_separator => '\{,',
}
open( FH, "Etot_Data.txt" ) or die $!;
while ( my $chunk = <FH> ){
my $sections = [ split $conf->{chunk_separator}, $chunk ]
print "$chunk\n";
}
Output
I would want to pick "ID" from each value and prepend "abc." to it
Final String would look like abc.000628 or abc.000703 and save it in a hash
There's no need of another values except the ID in json string
Is it possible to read the json file as a normal file and operate on it.
I don't have json parser and I don't have an option to use it
Thanks for the help
If you can't install any external modules, you can sure include it...
Create a JSON directory in the same directory your script is in, then copy the contents of the JSON::PP module, and put it into a PP.pm file inside of the JSON directory you just created. Then, in your script, add the current working directory as a library directory: use lib '.';, and use JSON::PP;.
use warnings;
use strict;
use lib '.';
use JSON::PP qw(decode_json);
my $json;
{
local $/;
open my $fh, '<', 'file.json' or die $!;
$json = <$fh>;
}
my $perl = decode_json $json;
for (#$perl){
my $id = 'abc.' . $_->{ID};
print "$id\n";
}
Output:
abc.000628
abc.000703
If you need to hide the fact you've created an additional module, with only slight finagling, you can make some changes to the module, and include it directly within the script itself.
Note that JSON::PP is in Perl core in v5.14+. OP stated in comments that they are on 5.10.
Everything that people have said in the comments is true. JSON is a complex data format and expecting to parse it without using the tools that already exist is foolish in the extreme. I urge you to fix whatever roadblock is preventing you from installing and using a JSON parser.
However...
If you only want the ID, and the format is always identical to your sample data, then there is a way to do it. This code is pretty fragile and I can't emphasise enough how much of a bad idea this is. But this code seems to work on your sample data.
#!/usr/bin/perl
use strict;
use warnings;
use 5.010;
while (<>) {
foreach my $id (/"ID":"(\d+)"/g) {
say "abc.$id";
}
}
It reads from STDIN and writes to STDOUT, so call it like this:
$ ./parse_json.pl < Etot_data.txt
But please get a JSON parser installed.

Regex to split path value C:\Users\goudarsh\Desktop\Perl_test_scripts\rtl2gds

I need the perl regex to split the following value
$path = 'C:\Users\goudarsh\Desktop\Perl_test_scripts\sample';
i tried following code seems not working
my #var = split(/\\/,$path);
print #var;
if(grep /rtl2gds/, #var){
print $path;
}
i am not getting where i am doing wrong.
even i tried following
my #var = split(//\/,$path);
print #var;
if(grep /rtl2gds/, #var){
print $path;
}
Instead of relying on manual splitting, I recommend using File::Spec
use File::Spec;
my ($volume, $dir, $file) = File::Spec->splitpath($path);
my #components = File::Spec->splitdir($dir);
push #components, $file;
Now #components is your desired array with a safer and more portable implementation.
Your example works fine... have you actually run it?
use strict;
use warnings;
use Data::Dumper;
my $path = 'C:\Users\goudarsh\Desktop\Perl_test_scripts\sample';
my #var = split(/\\/, $path);
print Dumper(\#var);
Output:
$VAR1 = [
'C:',
'Users',
'goudarsh',
'Desktop',
'Perl_test_scripts',
'sample'
];
Because the Path::File docs are a bit convoluted, here is an example:
use strict;
use warnings;
use 5.020;
use Path::Class; # Exports file() by default
my $path = file('/Users/7stud/perl_programs/myprog.pl');
say $path->basename; # => myprog.pl
say $path->dir; # => /Users/7stud/perl_programs
say $path->volume; # => ""
my #components = $path->components;
for my $component (#components) {
say "-->$component<--";
}
--output:--
--><--
-->Users<--
-->7stud<--
-->perl_programs<--
-->myprog.pl<--
On Windows, a module will load automatically that understands Window's style paths. To examine a Window's style path on a Unix system:
use strict;
use warnings;
use 5.020;
use Path::Class qw{ foreign_file };
my $path = foreign_file('Win32', 'C:\Users\goudarsh\Desktop\Perl_test_scripts\sample');
say $path->basename; # => sample
say $path->dir; # => C:\Users\goudarsh\Desktop\Perl_test_scripts
say $path->volume; # => C:
my #components = $path->components;
for my $component (#components) {
say "-->$component<--";
}
--output:--
--><--
-->Users<--
-->goudarsh<--
-->Desktop<--
-->Perl_test_scripts<--
-->sample<--

Perl Match Substring in a string ignore whitespace

I have a string
$str = "xxxxxx Code File(s) Name:Some_thing.c CodeFile(s) Version:27 Design Document:some_other_design.doc Module Design Document Version:43 somexxxxxxxxxx Compiler Version:9.5 Doc Type:Word xxxxxx";
where xxxxx represents any character. But i am only interested in extracting the values of each attribute.
ie I want to save
$fileName = Some_thing.c;
$fileVersion = 27;
$designDocName = some_other_design.doc;
$designDocVersion = 43;
$compilerVersion = 9.5;
Right now I feel like I have a messed up solution. Just wondering if there is a cleaner way to do this. This might also fail if i have multiple files with multiple versions.
First i remove all whitespaces, and next i split the string into 2 till i get all the values
$str =~ s/\s*//g;
($temp,$temp2) = split(/CodeFile\(s\)Name:/,$str,2);
($fileName,$temp) = split(/CodeFile\(s\)Version:/,$temp2,2);
($fileVersion,$temp2) = split(/DesignDocument:/,$temp,2);
($designDocName,$temp) = split(/DesignDocumentVersion:/,$temp2,2);
($designDocVersion,$temp2) = split(/some/,$temp,2);
($testedCompilerVersion,$temp) = split(/CompilerVersion:/,$temp2,2);
($testedCompilerVersion,$temp2) = split(/DocType:/,$temp,2);
Please lead me to a link or an efficient solution.
Thanks in advance.
PS: Please also check the comment below the question.
Perhaps the following will be helpful:
use strict;
use warnings;
use Data::Dumper;
my $str = "xxxxxx Code File(s) Name:Some_thing.c CodeFile(s) Version:27 Design Document:some_other_design.doc Module Design Document Version:43 somexxxxxxxxxx Compiler Version:9.5 Doc Type:Word xxxxxx";
my #labels = qw/fileName fileVersion designDocName designDocVersion compilerVersion docType/;
my ($i, %items) = 0;
$items{$labels[$i++]} = $1 while $str =~ /.+?:(\S+)\s+?/g;
print Dumper \%items
Output:
$VAR1 = {
'designDocName' => 'some_other_design.doc',
'fileName' => 'Some_thing.c',
'docType' => 'Word',
'designDocVersion' => '43',
'fileVersion' => '27',
'compilerVersion' => '9.5'
};
Although I would go with #Kenosis solution I still wanted to show you who your script could be simplified.
#!/usr/bin/perl
use v5.14;
use warnings;
my $str = "xxxxxx Code File(s) Name:Some_thing.c CodeFile(s) Version:27 Design Document:some_other_design.doc Module Design Document Version:43 somexxxxxxxxxx Compiler Version:9.5 Doc Type:Word xxxxxx";
my ($fileName,
$fileVersion,
$designDocName,
$designDocVersion,
$compilerVersion) = $str =~ /:(\S+)/g;
say "$fileName, $fileVersion, $designDocName, $designDocVersion, $compilerVersion";
#Some_thing.c, 27, some_other_design.doc, 43, 9.5
my ($fileName, $fileVersion, $designDocName, $designDocVersion, $compilerVersion) =
$str =~ /Code File\(s\) Name:(.*) CodeFile\(s\) Version:(.*) Design Document:(.*) Module Design Document Version:(.*) somexxxxxxxxxx Compiler Version:(.*) Doc Type:(.*) xxxxxx/;

Querying a website with Perl LWP::Simple to Process Online Prices

In my free time, I've been trying to improve my perl abilities by working on a script that uses LWP::Simple to poll one specific website's product pages to check the prices of products (I'm somewhat of a perl noob). This script also keeps a very simple backlog of the last price seen for that item (since the prices change frequently).
I was wondering if there was any way I could further automate the script so that I don't have to explicitly add the page's URL to the initial hash (i.e. keep an array of key terms and do a search query amazon to find the page or price?). Is there anyway way I could do this that doesn't involve me just copying Amazon's search URL and parsing in my keywords? (I'm aware that processing HTML with regex is generally bad form, I just used it since I only need one small piece of data).
#!usr/bin/perl
use strict;
use warnings;
use LWP::Simple;
my %oldPrice;
my %nameURL = (
"Archer Season 1" => "http://www.amazon.com/Archer-Season-H-Jon-Benjamin/dp/B00475B0G2/ref=sr_1_1?ie=UTF8&qid=1297282236&sr=8-1",
"Code Complete" => "http://www.amazon.com/Code-Complete-Practical-Handbook-Construction/dp/0735619670/ref=sr_1_1?ie=UTF8&qid=1296841986&sr=8-1",
"Intermediate Perl" => "http://www.amazon.com/Intermediate-Perl-Randal-L-Schwartz/dp/0596102062/ref=sr_1_1?s=books&ie=UTF8&qid=1297283720&sr=1-1",
"Inglorious Basterds (2-Disc)" => "http://www.amazon.com/Inglourious-Basterds-Two-Disc-Special-Brad/dp/B002T9H2LK/ref=sr_1_3?ie=UTF8&qid=1297283816&sr=8-3"
);
if (-e "backlog.txt"){
open (LOG, "backlog.txt");
while(){
chomp;
my #temp = split(/:\s/);
$oldPrice{$temp[0]} = $temp[1];
}
close(LOG);
}
print "\nChecking Daily Amazon Prices:\n";
open(LOG, ">backlog.txt");
foreach my $key (sort keys %nameURL){
my $content = get $nameURL{$key} or die;
$content =~ m{\s*\$(\d+.\d+)} || die;
if (exists $oldPrice{$key} && $oldPrice{$key} != $1){
print "$key: \$$1 (Was $oldPrice{$key})\n";
}
else{
print "\n$key: $1\n";
}
print LOG "$key: $1\n";
}
close(LOG);
Yes, the design can be improved. It's probably best to delete everything and start over with an existing full-featured web scraping application or framework, but since you want to learn:
The name-to-URL map is configuration data. Retrieve it from outside of the program.
Store the historic data in a database.
Learn XPath and use it to extract data from HTML, it's easy if you already grok CSS selectors.
Other stackers, if you want to amend my post with the rationale for each piece of advice, go ahead and edit it.
I made simple script to demonstate Amazon search automation. Search url for all departments was changed with escaped search term. The rest of code is simple parsing with HTML::TreeBuilder. Structure of HTML in question can be easily examined with dump method (see commented-out line).
use strict; use warnings;
use LWP::Simple;
use URI::Escape;
use HTML::TreeBuilder;
use Try::Tiny;
my $look_for = "Archer Season 1";
my $contents
= get "http://www.amazon.com/s/ref=nb_sb_noss?url=search-alias%3Daps&field-keywords="
. uri_escape($look_for);
my $html = HTML::TreeBuilder->new_from_content($contents);
for my $item ($html->look_down(id => qr/result_\d+/)) {
# $item->dump; # find out structure of HTML
my $title = try { $item->look_down(class => 'productTitle')->as_trimmed_text };
my $price = try { $item->look_down(class => 'newPrice')->find('span')->as_text };
print "$title\n$price\n\n";
}
$html->delete;

Regexp to find youtube url, strip off parameters and return clean video url?

imagine this url:
http://www.youtube.com/watch?v=6n8PGnc_cV4&feature=rec-LGOUT-real_rn-2r-13-HM
what is the cleanest and best regexp to do the following:
1.) i want to strip off every thing after the video URL. so that only http://www.youtube.com/watch?v=6n8PGnc_cV4 remains.
2.) i want to convert this url into http://www.youtube.com/v/6n8PGnc_cV4
Since i'm not much of a regexp-ert i need your help:
$content = preg_replace('http://.*?\?v=[^&]*', '', $content);
return $content;
edit: check this out! I want to create a really simple WordPress plugin that just recognizes every normal youtube URL in my $content and replaces it with the embed code:
<?php
function videoplayer($content) {
$embedcode = '<object class="video" width="308" height="100"><embed src="' . . '" type="application/x-shockwave-flash" allowscriptaccess="always" allowfullscreen="true" width="308" height="100" wmode="opaque"></embed></object>';
//filter normal youtube url like http://www.youtube.com/watch?v=6n8PGnc_cV4&feature=rec-LGOUT-real_rn-2r-13-HM
//convert it to http://www.youtube.com/v/6n8PGnc_cV4
//use embedcode and pass along the new youtube url
$content = preg_replace('', '', $content);
//return embedcode
return $content;
}
add_filter('the_content', 'videoplayer');
?>
I use this search criteria in my script:
/((http|ftp)\:\/\/)?([w]{3}\.)?(youtube\.)([a-z]{2,4})(\/watch\?v=)([a-zA-Z0-9_-]+)(\&feature=)?([a-zA-Z0-9_-]+)?/
You could just split it on the first ampersand.
$content = explode('&', $content);
$content = $content[0];
Edit: Simplest regexp: /http:\/\/www\.youtube\.com\/watch\?v=.*/
Youtube links are all the same. To get the video id from them, first you slice off the extra parameters from the end and then slice off everything but the last 11 characters. See it in action:
$url = "http://www.youtube.com/watch?v=1rnfE4eo1bY&feature=...";
$url = $url.left(42); // "http://www.youtube.com/watch?v=1rnfE4eo1bY"
$url = $url.right(11); // "1rnfE4eo1bY"
$result = "http://www.youtube.com/v/" + $url; // "http://www.youtube.com/v/1rnfE4eo1bY"
You can uniformize all your youtube links (by removing useless parameters) with a Greasemonkey script: http://userscripts.org/scripts/show/86758. Greasemonkey scripts are natively supported as addons in Google Chrome.
And as a bonus, here is a one (okay, actually two) liner:
$url = "http://www.youtube.com/watch?v=1rnfE4eo1bY&feature=...";
$result = "http://www.youtube.com/v/" + $url.left(42).right(11);
--3ICE
$url = "http://www.youtube.com/v/6n8PGnc_cV4";
$start = strpos($url,"v=");
echo 'http://www.youtube.com/v/'.substr($url,$start+2);