Perl - Parsing Arguments/Options with REGEX

Perl - Parsing Arguments/Options with REGEX - regex

I'm creating a perl script to convert a list of commands in a template file () and output them to another file in a different format in an output file ().
The commands in the template file will look as follows:
command1 --max-size=2M --type="some value"
I'm having some problems extracting the options and values from this string. So far i have:
m/(\s--\w*=)/ig
Which will return:
" --max-size="
" --type="
However I have no idea how to return both the option and value as a separate variable or how to accommodate for the use of quotes.
Could anyone steer me in the right direction?
side note: I'm aware that Getops does an awesome job at doing this from the command-line but unfortunately these commands are passed as strings :(

Getopt::Std or Getopt::Long?
Have you looked at this option or this one?
Seems like there's no reason to reinvent the wheel.

The code below produces
#args = ('command1', '--max-size=2M', '--type=some value');
That is suitable to pass to GetOptions as follows:
local #ARGV = #args;
GetOptions(...) or die;
Finally, the code:
for ($cmd) {
my #args;
while (1) {
last if /\G \s* \z /xgc;
/\G \s* /xgc;
my $arg;
while (1) {
if (/\G ([^\\"'\s]) /xgc) {
$arg .= $1;
}
elsif (/\G \\ /xgc) {
/\G (.) /sxgc
or die "Incomplete escape";
$arg .= $1;
}
elsif (/\G (?=") /xgc) {
/\G " ( (?:[^"\\]|\\.)* ) " /sxgc
or die "Incomplete double-quoted arging";
my $quoted = $1;
$quoted =~ s/\\(.)/$1/sg;
$arg .= $quoted;
}
elsif (/\G (?=') /xgc) {
/\G ' ( [^']* ) ' /xgc
or die "Incomplete single-quoted arging";
$arg .= $1;
}
else {
last;
}
}
push #args, $arg;
}
#args
or die "Blank command";
...
}

use Data::Dumper;
$_ = 'command1 --max-size=2M a=ignore =ignore --switch --type="some value" --x= --z=1';
my %args;
while (/((?<=\s--)[a-z\d-]+)(?:="?|(?=\s))((?<![="])|(?<=")[^"]*(?=")|(?<==)(?!")\S*(?!"))"?(?=\s|$)/ig) {
$args->{$1} = $2;
}
print Dumper($args);
---
$VAR1 = {
'switch' => '',
'x' => '',
'type' => 'some value',
'z' => '1',
'max-size' => '2M'
};
(test this demo here)

Related

Perl parsing JavaScript file regex, to catch quotes only at the beginning and end of the returned string

I'm just starting to learn Perl. I need to parse JavaScript file. I came up with the following subroutine, to do it:
sub __settings {
my ($_s) = #_;
my $f = $config_directory . "/authentic-theme/settings.js";
if ( -r $f ) {
for (
split(
'\n',
$s = do {
local $/ = undef;
open my $fh, "<", $f;
<$fh>;
}
)
)
{
if ( index( $_, '//' ) == -1
&& ( my #m = $_ =~ /(?:$_s\s*=\s*(.*))/g ) )
{
my $m = join( '\n', #m );
$m =~ s/[\'\;]//g;
return $m;
}
}
}
}
I have the following regex, that removes ' and ; from the string:
s/[\'\;]//g;
It works alright but if there is a mentioned chars (' and ;) in string - then they are also removed. This is undesirable and that's where I stuck as it gets a bit more complicated for me and I'm not sure how to change the regex above correctly to only:
Remove only first ' in string
Remove only last ' in string
Remove ont last ; in string if exists
Any help, please?

You can use the following to match:
^'|';?$|;$
And replace with '' (empty string)
See DEMO

Remove only first ' in string
Remove only last ' in string
^[^']*\K'|'(?=[^']*$)
Try this .See demo.
https://regex101.com/r/oF9hR9/8
Remove ont last ; in string if exists
;(?=[^;]*$)
Try this.See demo.
https://regex101.com/r/oF9hR9/9
All three in one
^[^']*\K'|'(?=[^']*$)|;(?=[^;]*$)
See Here

You can use this code:
#!/usr/bin/perl
$str = "'string; 'inside' another;";
$str =~ s/^'|'?;?$//g;
print $str;
IDEONE demo
The main idea is to use anchors: ^ beginning of string, $ end of string and ;? matches the ";" symbol at the end only if it is present (? quantifier is making the pattern preceding it optional).EDIT: Also, ; will get removed even if there is no preceding '.

I suggest that your original code should look more like this. It is much more idiomatic Perl and I think more straightforward to follow
sub __settings {
my ($_s) = #_;
my $file = "$config_directory/authentic-theme/settings.js";
return unless -r $file;
open my $fh, '<', $file or die qq{Unable to open "$file" for input: $!};
my #file = <$fh>;
chomp #file;
for ( #file ) {
next if m{//};
if ( my #matches = $_ =~ /(?:$_s\s*=\s*(.*))/g ) {
my $matches = join "\n", #matches;
$matches =~ tr/';//d;
return $matches;
}
}
}

perl regex not start and end wi

I am trying to write a perl script that get all strings that is does not start and end with a single quote. And a string cannot be a part of comment # and each line in DATA is not necessary at the beginning of a line.
use warnings;
use strict;
my $file;
{
local $/ = undef;
$file = <DATA>;
};
my #strings = $file =~ /(?:[^']).*(?:[^'])/g;
print join ("\n",#strings);
__DATA__
my $string = 'This is string1';
"This is string2"
# comment : "This is string3"
print "This is comment syntax #"."This is string4";
I am getting no where with this regex.
The expected output is
"This is a string2"
"This is comment syntax #"
"This is string 4"

Obviously this is only an exercise, as there are been many students asking about this problem lately. Regex's will only ever get you part of the way there, as there will pretty much always be edge cases.
The following code is probably good enough for your purposes, but it doesn't even successfully parse itself because of quotes inside a qr{}. You'll have to figure out how to get strings that span lines to work on your own:
use strict;
use warnings;
my $doublequote_re = qr{"(?: (?> [^\\"]+ ) | \\. )*"}x;
my $singlequote_re = qr{'(?: (?> [^\\']+ ) | \\. )*'}x;
my $data = do { local $/; <DATA> };
while ($data =~ m{(#.*|$singlequote_re|$doublequote_re)}g) {
my $match = $1;
if ($match =~ /^#/) {
print "Comment - $match\n";
} elsif ($match =~ /^"/) {
print "Double quote - $match\n";
} elsif ($match =~ /^'/) {
print "Single quote - $match\n";
} else {
die "Carp! something went wrong! <$match>";
}
}
__DATA__
my $string = 'This is string1';
"This is string2"
# comment : "This is string3"
print "This is comment syntax #"."This is string4";

Do not know how to achieve that by using regular expression, so here is a simple hand-written lexer:
#!/usr/bin/perl
use strict;
use warnings;
sub extract_string {
my #buf = split //, shift;
while (my $peer = shift #buf) {
if ($peer eq '"') {
my $str = "$peer";
while ($peer = shift #buf) {
$str .= "$peer";
last if $peer eq '"';
}
if ($peer) {
return ($str, join '', #buf);
}
else {
return ("", "");
}
}
elsif ($peer eq '#') {
return ("", "");
}
}
}
my ($str, $buf);
while ($buf = <DATA>) {
chomp $buf;
while (1) {
($str, $buf) = extract_string $buf;
print "$str\n" if $str;
last unless $buf;
}
}
__DATA__
my $string = 'This is string1';
"This is string2"
# comment : "This is string3"
print "This is comment syntax #"."This is string4";
Another option is using Perl module such as PPI.

Find text enclosed by # and replace the inside

The problem:
Find pieces of text in a file enclosed by # and replace the inside
Input:
#abc# abc #ABC#
cba #cba CBA#
Deisred output:
абц abc АБЦ
cba цба ЦБА
I have the following:
#!/usr/bin/perl
use strict;
use warnings;
use Encode;
my $output;
open FILE,"<", 'test.txt';
while (<FILE>) {
chomp(my #chars = split(//, $_));
for (#chars) {
my #char;
$_ =~ s/a/chr(0x430)/eg;
$_ =~ s/b/chr(0x431)/eg;
$_ =~ s/c/chr(0x446)/eg;
$_ =~ s/d/chr(0x434)/eg;
$_ =~ s/e/chr(0x435)/eg;
$_ =~ s/A/chr(0x410)/eg;
$_ =~ s/B/chr(0x411)/eg;
$_ =~ s/C/chr(0x426)/eg;
push #char, $_;
$output = join "", #char;
print encode("utf-8",$output);}
print "\n";
}
close FILE;
But I'm stuck on how to process further
Thanks for help in advance!
Kluther

Here my solution. (you will fixed it, yes. It is prototype)
for (my $data = <DATA>){
$data=~s/[#]([\s\w]+)[#]/func($1)/ge;
print $data;
# while($data=~m/[#]([\s\w]+)[#]/g){
# print "marked: ",$1,"\n";
# print "position:", pos();
# }
# print "not marked: ";
}
sub func{
#do your magic here ;)
return "<< #_ >>";
}
__DATA__
#abc# abc #ABC# cba #cba CBA#
What happens here?
First, I read data. You can do it yourself.
for (my $data = <DATA>){...}
Next, I need to search your pattern and replace it.
What should I do?
Use substition operator: s/pattern/replace/
But in interesting form:
s/pattern/func($1)/ge
Key g mean Global Search
Key e mean Evaluate
So, I think, that you need to write your own func function ;)
Maybe better to use transliteration operator: tr/listOfSymbolsToBeReplaced/listOfSymbolsThatBePlacedInstead/

With minimal changes to your algorithm you need to keep track of whether you are inside the #marks or not. so add something like this
my $bConvert = 0;
chomp(my #chars = split(//, $_));
for (#chars) {
my $char = $_;
if (/#/) {
$bConvert = ($bConvert + 1) % 2;
next;
}
elsif ($bConvert) {
$char =~ s/a/chr(0x430)/eg;
$char =~ s/b/chr(0x431)/eg;
$char =~ s/c/chr(0x446)/eg;
$char =~ s/d/chr(0x434)/eg;
$char =~ s/e/chr(0x435)/eg;
$char =~ s/A/chr(0x410)/eg;
$char =~ s/B/chr(0x411)/eg;
$char =~ s/C/chr(0x426)/eg;
}
print encode("utf-8",$char);
}

Try this after $output is processed.
$output =~ s/\#//g;
my #split_output = split(//, $output);
$output = "";
my $len = scalar(#split_output) ;
while ($len--) {
$output .= shift(#split_output);
}
print $output;

It can be done with a single regex and no splitting of the string:
use strict;
use warnings;
use Encode;
my %chars = (
a => chr(0x430),
b => chr(0x431),
c => chr(0x446),
d => chr(0x434),
e => chr(0x435),
A => chr(0x410),
B => chr(0x411),
C => chr(0x426),
);
my $regex = '(' . join ('|', keys %chars) . ')';
while (<DATA>) {
1 while ($_ =~ s|\#(?!\s)[^#]*?\K$regex(?=[^#]*(?!\s)\#)|$chars{$1}|eg);
print encode("utf-8",$_);
}
It does require repeated runs of the regex due to the overlapping nature of the matches.

How to extract the text between two patterns using REGEX perl

In the following lines how can I store the lines between "Description:" and "Tag:" in a variable using REGEX PERL and what would be a good datatype to use, string or list or something else?
(I am trying to write a program in Perl to extract the information of a text file with Debian package information and convert it into a RDF(OWL) file(ontology).)
Description: library for decoding ATSC A/52 streams (development)
liba52 is a free library for decoding ATSC A/52 streams. The A/52 standard is
used in a variety of applications, including digital television and DVD. It is
also known as AC-3.
This package contains the development files.
Homepage: http://liba52.sourceforge.net/
Tag: devel::library, role::devel-lib
The code I have written so far is:
#!/usr/bin/perl
open(DEB,"Packages");
open(ONT,">>debianmodelling.txt");
$i=0;
while(my $line = <DEB>)
{
if($line =~ /Package/)
{
$line =~ s/Package: //;
print ONT ' <package rdf:ID="instance'.$i.'">';
print ONT ' <name rdf:datatype="http://www.w3.org/2001/XMLSchema#string">'.$line.'</name>'."\n";
}
elsif($line =~ /Priority/)
{
$line =~ s/Priority: //;
print ONT ' <priority rdf:datatype="http://www.w3.org/2001/XMLSchema#string">'.$line.'</priority>'."\n";
}
elsif($line =~ /Section/)
{
$line =~ s/Section: //;
print ONT ' <Section rdf:datatype="http://www.w3.org/2001/XMLSchema#string">'.$line.'</Section>'."\n";
}
elsif($line =~ /Maintainer/)
{
$line =~ s/Maintainer: //;
print ONT ' <maintainer rdf:datatype="http://www.w3.org/2001/XMLSchema#string">'.$line.'</maintainer>'."\n";
}
elsif($line =~ /Architecture/)
{
$line =~ s/Architecture: //;
print ONT ' <architecture rdf:datatype="http://www.w3.org/2001/XMLSchema#string">'.$line.'</architecture>'."\n";
}
elsif($line =~ /Version/)
{
$line =~ s/Version: //;
print ONT ' <version rdf:datatype="http://www.w3.org/2001/XMLSchema#string">'.$line.'</version>'."\n";
}
elsif($line =~ /Provides/)
{
$line =~ s/Provides: //;
print ONT ' <provides rdf:datatype="http://www.w3.org/2001/XMLSchema#string">'.$line.'</provides>'."\n";
}
elsif($line =~ /Depends/)
{
$line =~ s/Depends: //;
print ONT ' <depends rdf:datatype="http://www.w3.org/2001/XMLSchema#string">'.$line.'</depends>'."\n";
}
elsif($line =~ /Suggests/)
{
$line =~ s/Suggests: //;
print ONT ' <suggests rdf:datatype="http://www.w3.org/2001/XMLSchema#string">'.$line.'</suggests>'."\n";
}
elsif($line =~ /Description/)
{
$line =~ s/Description: //;
print ONT ' <Description rdf:datatype="http://www.w3.org/2001/XMLSchema#string">'.$line.'</Description>'."\n";
}
elsif($line =~ /Tag/)
{
$line =~ s/Tag: //;
print ONT ' <Tag rdf:datatype="http://www.w3.org/2001/XMLSchema#string">'.$line.'</Tag>'."\n";
print ONT ' </Package>'."\n\n";
}
$i=$i+1;
}

my $desc = "Description:";
my $tag = "Tag:";
$line =~ /$desc(.*?)$tag/;
my $matched = $1;
print $matched;
or
my $desc = "Description:";
my $tag = "Tag:";
my #matched = $line =~ /$desc(.*?)$tag/;
print $matched[0];
or
my $desc = "Description:";
my $tag = "Tag:";
(my $matched = $line) =~ s/$desc(.*?)$tag/$1/;
print $matched;
Additional
If your Description and Tag may be on separate lines, you may need to use the /s modifier, to treat it as a single line, so the \n won't wreck it. Example:
$_=qq{Description:foo
more description on
new line Tag: some
tag};
s/Description:(.*?)Tag:/$1/s; #notice the trailing slash
print;

Assuming:
my $example; # holds the example text above
You could:
(my $result=$example)=~s/^.*?\n(Description:)/$1/s; # strip up to first marker
$result=~s/(\nTag:[^\n]*\n).+$/$1/s; # strip everything after second marker line
Or
(my $result=$example)=~s/^.*?\n(Description:.+?Tag:[^\n]*\n).*$/$1/s;
Both assume the Tag: value is contained on a single line.
If this is not the case, you might try:
(my $result=$example)=~s/
( # start capture
Description: # literal 'Description:'
.+? # any chars (non-greedy) up to
Tag: # literal 'Tag:'
.+? # any chars up to
)
(?: # either
\n[A-Z][a-z]+\: # another tagged value name
| # or
$ # end of string
)
/$1/sx;

I believe that the problem is caused by using a line reading loop for data structured by paragraphs. If you can slurp the file into memory and and apply split with a captured delimiter, the processing will be much smoother:
#!/usr/bin/perl -w
use strict;
use diagnostics;
use warnings;
use English;
# simple sample sub
my $printhead = sub {
printf "%5s got the tag '%s ...'\n", '', substr( shift, 0, 30 );
};
# map keys/tags? to functions
my %tagsoups = (
'PackageName' => sub {printf "%5s got the name '%s'\n", '', shift;}
, 'Description' => sub {printf "%5s got the description:\n---------\n%s\n----------\n", '', shift;}
, 'Tag' => $printhead
);
# slurp Packages (fallback: parse using $INPUT_RECORD_SEPARATOR = "Package:")
open my $fh, "<", './Packages-00.txt' or die $!;
local $/; # enable localized slurp mode
my $all = <$fh>;
my #pks = split /^(Package):\s+/ms, $all;
close $fh;
# outer loop: Packages
for (my $p = 1, my $n = 0; $p < scalar #pks; $p +=2) {
my $blk = "PackageName: " . $pks[$p + 1];
my #inf = split /\s*^([\w-]+):\s+/ms, $blk;
printf "%3d %s named %s\n", ++$n, $pks[$p], $inf[ 2 ];
# outer loop: key-value-pairs (or whatever they are called)
for (my $x = 1; $x < scalar #inf; $x += 2) {
if (exists($tagsoups{$inf[ $x ]})) {
$tagsoups{$inf[ $x ]}($inf[$x + 1]);
}
}
}
output for a shortened Packages file from my Ubuntu Linux:
3 Package named abrowser-3.5-branding
got the PackageName:
---------
abrowser-3.5-branding
----------
got the Description:
---------
dummy upgrade package for firefox-3.5 -> firefox
This is a transitional package so firefox-3.5 users get firefox on
upgrades. It can be safely removed.
----------
4 Package named casper
got the PackageName:
---------
casper
----------
got the Description:
---------
Run a "live" preinstalled system from read-only media
----------
got the Tag:
---------
admin::boot, admin::filesystem, implemented-in::shell, protocol::smb, role::plugin, scope::utility, special::c
ompletely-tagged, works-with-format::iso9660
----------
Using a hash for the functions to apply to the extracted parts will keep the details of generating xml out of the parser loops.

Masking a string in perl using a mask string

I have a string such as 'xxox-x' that I want to mask each line in a file against as such:
x's are ignored (or just set to a known value)
o's remain unchanged
the - is a variable length field that will keep everything else unchanged
therefore mask 'xxox-x' against 'deadbeef' would yield 'xxaxbeex'
the same mask 'xxox-x' against 'deadabbabeef' would yield 'xxaxabbabeex'
How can I do this succinctly preferrably using s operator?

$mask =~ s/-/'o' x (length $str - length $mask)/e;
$str =~ s/(.)/substr($mask, pos $str, 1) eq 'o' ? $1 : 'x'/eg;

$ perl -pe 's/^..(.).(.+).$/xx$1x$2x/;'
deadbeef
xxaxbeex
deadabbabeef
xxaxabbabeex

Compile your pattern into a Perl sub:
sub compile {
use feature 'switch';
my($pattern) = #_;
die "illegal pattern" unless $pattern =~ /^[-xo]+$/;
my($search,$replace);
my $i = 0;
for (split //, $pattern) {
given ($_) {
when ("x") {
$search .= "."; $replace .= "x";
}
when ("o") {
$search .= "(?<sub$i>.)";
$replace .= "\$+{sub$i}";
++$i;
}
when ("-") {
$search .= "(?<sub$i>.*)";
$replace .= "\$+{sub$i}";
++$i;
}
}
}
my $code = q{
sub {
local($_) = #_;
s/^SEARCH$/REPLACE/s;
$_;
}
};
$code =~ s/SEARCH/$search/;
$code =~ s/REPLACE/$replace/;
#print $code;
local $#;
my $sub = eval $code;
die $# if $#;
$sub;
}
To be more concise, you could write
sub _patref { '$+{sub' . $_[0]++ . '}' }
sub compile {
my($pattern) = #_;
die "illegal pattern" unless $pattern =~ /^[-xo]+$/;
my %gen = (
'x' => sub { $_[1] .= '.'; $_[2] .= 'x' },
'o' => sub { $_[1] .= "(?<sub$_[0]>.)"; $_[2] .= &_patref },
'-' => sub { $_[1] .= "(?<sub$_[0]>.*)"; $_[2] .= &_patref },
);
my($i,$search,$replace) = (0,"","");
$gen{$1}->($i,$search,$replace)
while $pattern =~ /(.)/g;
eval "sub { local(\$_) = \#_; s/\\A$search\\z/$replace/; \$_ }"
or die $#;
}
Testing it:
use v5.10;
my $replace = compile "xxox-x";
my #tests = (
[ deadbeef => "xxaxbeex" ],
[ deadabbabeef => "xxaxabbabeex" ],
);
for (#tests) {
my($input,$expect) = #$_;
my $got = $replace->($input);
print "$input => $got : ", ($got eq $expect ? "PASS" : "FAIL"), "\n";
}
Output:
deadbeef => xxaxbeex : PASS
deadabbabeef => xxaxabbabeex : PASS
Note that you'll need Perl 5.10.x for given ... when.

x can be translated to . and o to (.) whereas - becomes (.+?):
#!/usr/bin/perl
use strict; use warnings;
my %s = qw( deadbeef xxaxbeex deadabbabeef xxaxabbabeex);
for my $k ( keys %s ) {
(my $x = $k) =~ s/^..(.).(.+?).\z/xx$1x$2x/;
print +($x eq $s{$k} ? 'good' : 'bad'), "\n";
}

heres a quick stab at a regex generator.. maybe somebody can refactor something pretty from it?
#!/usr/bin/perl
use strict;
use Test::Most qw( no_plan );
my $mask = 'xxox-x';
is( mask( $mask, 'deadbeef' ), 'xxaxbeex' );
is( mask( $mask, 'deadabbabeef' ), 'xxaxabbabeex' );
sub mask {
my ($mask, $string) = #_;
my $regex = $mask;
my $capture_index = 1;
my $mask_rules = {
'x' => '.',
'o' => '(.)',
'-' => '(.+)',
};
$regex =~ s/$_/$mask_rules->{$_}/g for keys %$mask_rules;
$mask =~ s/$_/$mask_rules->{$_}/g for keys %$mask_rules;
$mask =~ s/\./x/g;
$mask =~ s/\([^)]+\)/'$' . $capture_index++/eg;
eval " \$string =~ s/^$regex\$/$mask/ ";
$string;
}

Here's a character by character solution using substr rather that split. It should be efficient for long strings since it skips processing the middle part of the string (when there is a dash).
sub apply_mask {
my $mask = shift;
my $string = shift;
my ($head, $tail) = split /-/, $mask;
for( 0 .. length($head) - 1 ) {
my $m = substr $head, $_, 1;
next if $m eq 'o';
die "Bad char $m\n" if $m ne 'x';
substr($string, $_, 1) = 'x';
}
return $string unless defined $tail;
$tail = reverse $tail;
my $last_char = length($string) - 1;
for( 0 .. length($tail) - 1 ) {
my $m = substr $tail, $_, 1;
next if $m eq 'o';
die "Bad char $m\n" if $m ne 'x';
substr($string, $last_char - $_, 1) = 'x';
}
return $string;
}

sub mask {
local $_ = $_[0];
my $mask = $_[1];
$mask =~ s/-/'o' x (length($_)-(length($mask)-1))/e;
s/(.)/substr($mask, pos, 1) eq 'o' && $1/eg;
return $_;
}
Used tidbits from a couple answers ... this is what I ended up with.
EDIT: update from comments

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Perl - Parsing Arguments/Options with REGEX - regex

Getopt::Std or Getopt::Long? Have you looked at this option or this one? Seems like there's no reason to reinvent the wheel.

Related

Perl parsing JavaScript file regex, to catch quotes only at the beginning and end of the returned string

perl regex not start and end wi

Find text enclosed by # and replace the inside

How to extract the text between two patterns using REGEX perl

Masking a string in perl using a mask string

Categories

Resources