Perl regex syntax generation - regex

This is a follow up to the question posted here: Perl Regex syntax
The results from that discussion yielded this script:
#!/usr/bin/env perl
use strict;
use warnings;
my #lines = <DATA>;
my $current_label = '';
my #ordered_labels;
my %data;
for my $line (#lines) {
if ( $line =~ /^\/(.*)$/ ) { # starts with slash
$current_label = $1;
push #ordered_labels, $current_label;
next;
}
if ( length $current_label ) {
if ( $line =~ /^(\d) "(.*)"$/ ) {
$data{$current_label}{$1} = $2;
next;
}
}
}
for my $label ( #ordered_labels ) {
print "$label <- as.factor($label\n";
print " , levels= c(";
print join(',',map { $_ } sort keys %{$data{$label}} );
print ")\n";
print " , labels= c(";
print join(',',
map { '"' . $data{$label}{$_} . '"' }
sort keys %{$data{$label}} );
print ")\n";
print " )\n";
}
__DATA__
...A bunch of nonsense I do not care about...
...
Value Labels
/gender
1 "M"
2 "F"
/purpose
1 "business"
2 "vacation"
3 "tiddlywinks"
execute .
Essentially, I need to build the Perl to accommodate a syntax shorthand found in the SPSS file. For adjacent columns, SPSS allows one to type something like:
VALUE LABELS
/agree1 to agree5
1 "Strongly disagree"
2 "Disagree"
3 "Neutral"
4 "Agree"
5 "Strongly agree"
As the script currently exists, it will generate this:
agree1 to agree5 <- factor(agree1 to agree5
, levels= c(1,2,3,4,5,6)
, labels= c("Strongly disagree","Disagree","Neutral","Agree","Strongly agree","N/A")
)
and I need it to produce something like this:
agree1 <- factor(agree1
, levels= c(1,2,3,4,5,6)
, labels= c("Strongly disagree","Disagree","Neutral","Agree","Strongly agree","N/A")
)
agree2 <- factor(agree2
, levels= c(1,2,3,4,5,6)
, labels= c("Strongly disagree","Disagree","Neutral","Agree","Strongly agree","N/A")
)
…

use strict;
use warnings;
main();
sub main {
my #lines = <DATA>;
my $vlabels = get_value_labels(#lines);
write_output_delim($vlabels);
}
# Extract the value label information from SPSS syntax.
sub get_value_labels {
my (#vlabels, $i, $j);
for my $line (#_){
if ( $line =~ /^\/(.+)/ ){
my #vars = parse_var_range($1);
$i = #vlabels;
$j = $i + #vars - 1;
push #vlabels, { var => $_, codes => [] } for #vars;
}
elsif ( $line =~ /^\s* (\d) \s+ "(.*)"$/x ){
push #{$vlabels[$_]{codes}}, [$1, $2] for $i .. $j;
}
}
return \#vlabels;
}
# A helper function to handle variable ranges: "agree1 to agree3".
sub parse_var_range {
my $vr = shift;
my #vars = split /\s+ to \s+/x, $vr;
return $vr unless #vars > 1;
my ($stem) = $vars[0] =~ /(.+?)\d+$/;
my #n = map { /(\d+)$/ } #vars;
return map { "$stem" . $_ } $n[0] .. $n[1];
}
sub write_output_delim {
my $vlabels = shift;
for my $vlab (#$vlabels){
print $vlab->{var}, "\n";
print join("\t", '', #$_), "\n" for #{$vlab->{codes}}
}
}
sub write_output_factors {
# You get the idea...
}
__DATA__
/gender
1 "M"
2 "F"
/purpose
1 "business"
2 "vacation"
3 "tiddlywinks"
/agree1 to agree3
1 "Disagree"
2 "Neutral"
3 "Agree"

Related

Read all the lines between a pattern match using Perl

I want to read a csv file. The csv files is like below,
cat,dog,0
apple,banana,0
#start_loop
jug,ball,0
tub, jar,3
#stop_loop
phone,bottle,10
#per head
#start_loop
cup,book,7
laptop,charger,9
#stop_loop
For above csv, I want to read between the #start and #stop. Also, I want to differentiate two #start_loop 's for ex, the lines from first #start_loop goes to one array and another #start_loop goes to another array.
The expected result would be:
#array1 = {jug ball 0, tub jar 3}
and
#array2 = { cup book 7, laptop charger 9}
How do I solve this problem?
You can use the flip-flop operator. It's value will be 1 for the #start line and will contain E for the #stop line.
#!/usr/bin/perl
use warnings;
use strict;
my #arr;
while (<>) {
chomp;
my $inside = /^#start_loop/ .. /^#stop_loop/;
if ($inside) {
push #arr, [] if 1 == $inside;
undef $inside if 1 == $inside
|| $inside =~ /E/;
push #{ $arr[-1] }, $_ if $inside;
}
}
use Data::Dumper; print Dumper \#arr;
Simplified version of choroba's solution:
my #arr;
while (<>) {
chomp;
my $inside = /^#start_loop/ .. /^#stop_loop/;
if ( $inside == 1 ) { push #arr, []; } # Start line
elsif ( $inside !~ /E/ ) { push #{ $arr[-1] }, $_; } # Middle line
}
Without the flip-flop:
my #arr;
my $inside;
while (<>) {
chomp;
if ( /^#start_loop/ ) { $inside = 1; push #arr, []; } # Start line
elsif ( /^#stop_loop/ ) { $inside = 0; } # Stop line
elsif ( $inside ) { push #{ $arr[-1] }, $_; } # Middle line
}

How to place a regular expression in a perl map operation using the conditional operator

Hi I try to shorten a loop construct using a map operation and a conditional operator with a regular expression, but get not the right results. I think the reason for the failure is the difference between the assigment and the regexp operator (= vs. =~ ). What is the right formulation for this approach.
Long version:
print("TABLE.BEGIN\n");
while(my #data = $sth->fetchrow_array) {
$row++;
for my $ix (0..$#data) {
my $val = $data[$ix];
if ($val) {
$val =~ s/\t/[:TAB:]/g;
} else {
$val = 'NULL';
}
$data[$ix] = $val;
}
print "ROW: $row\t",join("\t",#data),"\n";
}
print("END.TABLE\n");
Long version result
TABLE.BEGIN
ROW: 1 12 79 1 PhoViComp 2017-05-22-PhoViComp 2017-05-22 32632 rostock HRO punjabi 2017-05-22-PhoViComp /net/server/path/
END.TABLE
Short version
print("TABLE.BEGIN\n");
my $row=0;
while(my #data = $sth->fetchrow_array) {
$row++;
#data = map { $_ ? s/\t/[:TAB:]/g : 'NULL' } #data;
print "ROW: $row\t",join("\t",#data),"\n";
}
print("END.TABLE\n");
Short version result
TABLE.BEGIN
ROW: 1
END.TABLE
In list context, s/\t/[:TAB:]/g returns 1 and nothing otherwise. You need to return the modified $_.
So you had something close to
for (#data) {
my $val = $_;
if (defined($val)) {
$val =~ s/\t/[:TAB:]/g;
} else {
$val = 'NULL';
}
$_ = $val;
}
The map version would be
#data = map {
my $val = $_;
if (defined($val)) {
$val =~ s/\t/[:TAB:]/g;
} else {
$val = 'NULL';
}
$val
} #data;
Of course, you could have written
$_ = defined($_) ? s/\t/[:TAB:]/rg : 'NULL' for #data; # 5.14+
The equivalent (but slower) map version would be
#data = map { defined($_) ? s/\t/[:TAB:]/rg : 'NULL' } #data; # 5.14+

Print out preceding lines after regex match

I am writing a sort of quiz program. I am using a .txt file as a test bank, but cant figure out how to (using regex's) match each question and print out the possible answers on different lines.I originally was just going to do true false so I didnt need to match anything else and just matching "1" worked fine. Basically I just need the question on one line and the answers on others. Here is an example of a question
1.) some text
a.) solution
b.) solution
c.) solution
code i had before:
while (<$test>) {
foreach my $line (split /\n/) {
my $match1 = "1";
if ($line =~ /$match1/) {
$question1 = $line;
print "$question1\n";
print "Answer: ";
$answer1 = <>;
chomp ($answer1);
if ( $answer1 =~ /(^a$)/i) {
$score1 = 20;
push #score, $score1;
}
}
I really couldn't get what you were getting at, so I wrote this sample program.
use 5.016;
use strict;
use warnings;
my ( #lines, #questions, $current_question );
sub prompt {
my ( $prompt ) = #_;
print $prompt, ' ';
STDOUT->flush;
my $val = <>;
return $val;
}
QUESTION:
while ( <DATA> ) {
if ( my ( $ans ) = m/^=(\w+)/ ) {
INPUT: {
say #lines;
last unless defined( my $answer = prompt( 'Your answer:' ));
say '';
my ( $response ) = $answer =~ /([a-z])\s*$/;
if ( not $response ) {
$answer =~ s/\s*$//; #/
say "Invalid response. '$answer' is not an answer!\n";
redo INPUT;
}
if ( $response eq $ans ) {
say 'You are right!';
}
elsif ( my $ansln = $current_question->{$response} ) {
if ( $response eq 'q' ) {
say 'Quitting...';
last QUESTION;
}
say <<"END_SAY";
You chose:\n$current_question->{$response}
The correct answer was:\n$current_question->{$ans}
END_SAY
}
else {
say "Invalid response. '$response' is not an answer!\n";
redo INPUT;
}
};
#lines = ();
prompt( 'Press enter to continue.' );
say '';
}
else {
if ( my ( $qn, $q ) = m/^\s*(\d+)\.\)\s+(.*\S)\s*$/ ) {
push #questions, $current_question = { question => $q };
}
else {
my ( $l, $a ) = m/^\s+([a-z])/;
$current_question->{$l} = ( m/(.*)/ )[0];
}
push #lines, $_;
}
}
__DATA__
1.) Perl is
a.) essential
b.) fun
c.) useful
=c
2.) This question is
a.) Number two
b.) A test to see how this format is parsed.
c.) Unneeded
=b
This is probably over simplified.
It just reads in the test data and creates a structure.
You could use it to grade test takers answers.
use strict;
use warnings;
use Data::Dumper;
$/ = undef;
my $testdata = <DATA>;
my %HashTest = ();
my $hchoices;
my $hqeustion;
my $is_question = 0;
while ( $testdata =~ /(^.*)\n/mg )
{
my $line = $1;
$line =~ s/^\s+|\s+$//g;
next if ( length( $line ) == 0);
if ( $line =~ /^(\d+)\s*\.\s*\)\s*(.*)/ )
{
$is_question = 1;
$HashTest{ $1 }{'question'} = $2;
$HashTest{ $1 }{'choices'} = {};
$HashTest{ $1 }{'answer'} = 'unknown';
$hqeustion = $HashTest{ $1 };
$hchoices = $HashTest{ $1 }{'choices'};
}
elsif ( $is_question && $line =~ /^\s*(answer)\s*:\s*([a-z])/ )
{
$hqeustion->{'answer'} = $2;
}
elsif ( $is_question && $line =~ /^\s*([a-z])\s*\.\s*\)\s*(.*)/ )
{
$hchoices->{ $1 } = $2;
}
}
print "\nQ & A summary\n-------------------------\n";
for my $qnum ( keys %HashTest )
{
print "Question $qnum: $HashTest{$qnum}{'question'}'\n";
my $ans_code = $HashTest{$qnum}{'answer'};
print "Answer: ($ans_code) $HashTest{$qnum}{'choices'}{$ans_code}\n\n";
}
print "---------------------------\n";
print Dumper(\%HashTest);
__DATA__
1.) What is the diameter of the earth?
a.) Half the distance to the sun
b.) Same as the moon
c.) 6,000 miles
answer: c
2.) Who is buried in Grants Tomb?
a.) Thomas Edison
b.) Grant, who else
c.) Jimi Hendrix
answer: b
Output:
Q & A summary
-------------------------
Question 1: What is the diameter of the earth?'
Answer: (c) 6,000 miles
Question 2: Who is buried in Grants Tomb?'
Answer: (b) Grant, who else
---------------------------
$VAR1 = {
'1' => {
'question' => 'What is the diameter of the earth?',
'answer' => 'c',
'choices' => {
'c' => '6,000 miles',
'a' => 'Half the distance to the sun',
'b' => 'Same as the moon'
}
},
'2' => {
'question' => 'Who is buried in Grants Tomb?',
'answer' => 'b',
'choices' => {
'c' => 'Jimi Hendrix',
'a' => 'Thomas Edison',
'b' => 'Grant, who else'
}
}
};

How to automagically create pattern based on real data?

I have many vendors in database, they all differ in some aspect of their data. I'd like to make data validation rule which is based on previous data.
Example:
A: XZ-4, XZ-23, XZ-217
B: 1276, 1899, 22711
C: 12-4, 12-75, 12
Goal: if user inputs string 'XZ-217' for vendor B, algorithm should compare previous data and say: this string is not similar to vendor B previous data.
Is there some good way/tools to achieve such comparison? Answer could be some generic algoritm or Perl module.
Edit:
The "similarity" is hard to define, i agree. But i'd like to catch to algorithm, which could analyze previous ca 100 samples and then compare the outcome of analyze with new data. Similarity may based on length, on use of characters/numbers, string creation patterns, similar beginning/end/middle, having some separators in.
I feel it is not easy task, but on other hand, i think it has very wide use. So i hoped, there is already some hints.
You may want to peruse:
http://en.wikipedia.org/wiki/String_metric and http://search.cpan.org/dist/Text-Levenshtein/Levenshtein.pm (for instance)
Joel and I came up with similar ideas. The code below differentiates 3 types of zones.
one or more non-word characters
alphanumeric cluster
a cluster of digits
It creates a profile of the string and a regex to match input. In addition, it also contains logic to expand existing profiles. At the end, in the task sub, it contains some pseudo logic which indicates how this might be integrated into a larger application.
use strict;
use warnings;
use List::Util qw<max min>;
sub compile_search_expr {
shift;
#_ = #{ shift() } if #_ == 1;
my $str
= join( '|'
, map { join( ''
, grep { defined; }
map {
$_ eq 'P' ? quotemeta;
: $_ eq 'W' ? "\\w{$_->[1],$_->[2]}"
: $_ eq 'D' ? "\\d{$_->[1],$_->[2]}"
: undef
;
} #$_
)
} #_ == 1 ? #{ shift } : #_
);
return qr/^(?:$str)$/;
}
sub merge_profiles {
shift;
my ( $profile_list, $new_profile ) = #_;
my $found = 0;
PROFILE:
for my $profile ( #$profile_list ) {
my $profile_length = #$profile;
# it's not the same profile.
next PROFILE unless $profile_length == #$new_profile;
my #merged;
for ( my $i = 0; $i < $profile_length; $i++ ) {
my $old = $profile->[$i];
my $new = $new_profile->[$i];
next PROFILE unless $old->[0] eq $new->[0];
push( #merged
, [ $old->[0]
, min( $old->[1], $new->[1] )
, max( $old->[2], $new->[2] )
]);
}
#$profile = #merged;
$found = 1;
last PROFILE;
}
push #$profile_list, $new_profile unless $found;
return;
}
sub compute_info_profile {
shift;
my #profile_chunks
= map {
/\W/ ? [ P => $_ ]
: /\D/ ? [ W => length, length ]
: [ D => length, length ]
}
grep { length; } split /(\W+)/, shift
;
}
# Psuedo-Perl
sub process_input_task {
my ( $application, $input ) = #_;
my $patterns = $application->get_patterns_for_current_customer;
my $regex = $application->compile_search_expr( $patterns );
if ( $input =~ /$regex/ ) {}
elsif ( $application->approve_divergeance( $input )) {
$application->merge_profiles( $patterns, compute_info_profile( $input ));
}
else {
$application->escalate(
Incident->new( issue => INVALID_FORMAT
, input => $input
, customer => $customer
));
}
return $application->process_approved_input( $input );
}
Here is my implementation and a loop over your test cases. Basically you give a list of good values to the function and it tries to build a regex for it.
output:
A: (?^:\w{2,2}(?:\-){1}\d{1,3})
B: (?^:\d{4,5})
C: (?^:\d{2,2}(?:\-)?\d{0,2})
code:
#!/usr/bin/env perl
use strict;
use warnings;
use List::MoreUtils qw'uniq each_arrayref';
my %examples = (
A => [qw/ XZ-4 XZ-23 XZ-217 /],
B => [qw/ 1276 1899 22711 /],
C => [qw/ 12-4 12-75 12 /],
);
foreach my $example (sort keys %examples) {
print "$example: ", gen_regex(#{ $examples{$example} }) || "Generate failed!", "\n";
}
sub gen_regex {
my #cases = #_;
my %exploded;
# ex. $case may be XZ-217
foreach my $case (#cases) {
my #parts =
grep { defined and length }
split( /(\d+|\w+)/, $case );
# #parts are ( XZ, -, 217 )
foreach (#parts) {
if (/\d/) {
# 217 becomes ['\d' => 3]
push #{ $exploded{$case} }, ['\d' => length];
} elsif (/\w/) {
#XZ becomes ['\w' => 2]
push #{ $exploded{$case} }, ['\w' => length];
} else {
# - becomes ['lit' => '-']
push #{ $exploded{$case} }, ['lit' => $_ ];
}
}
}
my $pattern = '';
# iterate over nth element (part) of each case
my $ea = each_arrayref(values %exploded);
while (my #parts = $ea->()) {
# remove undefined (i.e. optional) parts
my #def_parts = grep { defined } #parts;
# check that all (defined) parts are the same type
my #part_types = uniq map {$_->[0]} #def_parts;
if (#part_types > 1) {
warn "Parts not aligned\n";
return;
}
my $type = $part_types[0]; #same so make scalar
# were there optional parts?
my $required = (#parts == #def_parts);
# keep the values of each part
# these are either a repitition or lit strings
my #values = sort uniq map { $_->[1] } #def_parts;
# these are for non-literal quantifiers
my $min = $required ? $values[0] : 0;
my $max = $values[-1];
# write the specific pattern for each type
if ($type eq '\d') {
$pattern .= '\d' . "{$min,$max}";
} elsif ($type eq '\w') {
$pattern .= '\w' . "{$min,$max}";
} elsif ($type eq 'lit') {
# quote special characters, - becomes \-
my #uniq = map { quotemeta } uniq #values;
# join with alternations, surround by non-capture grouup, add quantifier
$pattern .= '(?:' . join('|', #uniq) . ')' . ($required ? '{1}' : '?');
}
}
# build the qr regex from pattern
my $regex = qr/$pattern/;
# test that all original patterns match (#fail should be empty)
my #fail = grep { $_ !~ $regex } #cases;
if (#fail) {
warn "Some cases fail for generated pattern $regex: (#fail)\n";
return '';
} else {
return $regex;
}
}
To simplify the work of finding the pattern, optional parts may come at the end, but no required parts may come after optional ones. This could probably be overcome but it might be hard.
If there was a Tie::StringApproxHash module, it would fit the bill here.
I think you're looking for something that combines the fuzzy-logic functionality of String::Approx and the hash interface of Tie::RegexpHash.
The former is more important; the latter would make light work of coding.

Masking a string in perl using a mask string

I have a string such as 'xxox-x' that I want to mask each line in a file against as such:
x's are ignored (or just set to a known value)
o's remain unchanged
the - is a variable length field that will keep everything else unchanged
therefore mask 'xxox-x' against 'deadbeef' would yield 'xxaxbeex'
the same mask 'xxox-x' against 'deadabbabeef' would yield 'xxaxabbabeex'
How can I do this succinctly preferrably using s operator?
$mask =~ s/-/'o' x (length $str - length $mask)/e;
$str =~ s/(.)/substr($mask, pos $str, 1) eq 'o' ? $1 : 'x'/eg;
$ perl -pe 's/^..(.).(.+).$/xx$1x$2x/;'
deadbeef
xxaxbeex
deadabbabeef
xxaxabbabeex
Compile your pattern into a Perl sub:
sub compile {
use feature 'switch';
my($pattern) = #_;
die "illegal pattern" unless $pattern =~ /^[-xo]+$/;
my($search,$replace);
my $i = 0;
for (split //, $pattern) {
given ($_) {
when ("x") {
$search .= "."; $replace .= "x";
}
when ("o") {
$search .= "(?<sub$i>.)";
$replace .= "\$+{sub$i}";
++$i;
}
when ("-") {
$search .= "(?<sub$i>.*)";
$replace .= "\$+{sub$i}";
++$i;
}
}
}
my $code = q{
sub {
local($_) = #_;
s/^SEARCH$/REPLACE/s;
$_;
}
};
$code =~ s/SEARCH/$search/;
$code =~ s/REPLACE/$replace/;
#print $code;
local $#;
my $sub = eval $code;
die $# if $#;
$sub;
}
To be more concise, you could write
sub _patref { '$+{sub' . $_[0]++ . '}' }
sub compile {
my($pattern) = #_;
die "illegal pattern" unless $pattern =~ /^[-xo]+$/;
my %gen = (
'x' => sub { $_[1] .= '.'; $_[2] .= 'x' },
'o' => sub { $_[1] .= "(?<sub$_[0]>.)"; $_[2] .= &_patref },
'-' => sub { $_[1] .= "(?<sub$_[0]>.*)"; $_[2] .= &_patref },
);
my($i,$search,$replace) = (0,"","");
$gen{$1}->($i,$search,$replace)
while $pattern =~ /(.)/g;
eval "sub { local(\$_) = \#_; s/\\A$search\\z/$replace/; \$_ }"
or die $#;
}
Testing it:
use v5.10;
my $replace = compile "xxox-x";
my #tests = (
[ deadbeef => "xxaxbeex" ],
[ deadabbabeef => "xxaxabbabeex" ],
);
for (#tests) {
my($input,$expect) = #$_;
my $got = $replace->($input);
print "$input => $got : ", ($got eq $expect ? "PASS" : "FAIL"), "\n";
}
Output:
deadbeef => xxaxbeex : PASS
deadabbabeef => xxaxabbabeex : PASS
Note that you'll need Perl 5.10.x for given ... when.
x can be translated to . and o to (.) whereas - becomes (.+?):
#!/usr/bin/perl
use strict; use warnings;
my %s = qw( deadbeef xxaxbeex deadabbabeef xxaxabbabeex);
for my $k ( keys %s ) {
(my $x = $k) =~ s/^..(.).(.+?).\z/xx$1x$2x/;
print +($x eq $s{$k} ? 'good' : 'bad'), "\n";
}
heres a quick stab at a regex generator.. maybe somebody can refactor something pretty from it?
#!/usr/bin/perl
use strict;
use Test::Most qw( no_plan );
my $mask = 'xxox-x';
is( mask( $mask, 'deadbeef' ), 'xxaxbeex' );
is( mask( $mask, 'deadabbabeef' ), 'xxaxabbabeex' );
sub mask {
my ($mask, $string) = #_;
my $regex = $mask;
my $capture_index = 1;
my $mask_rules = {
'x' => '.',
'o' => '(.)',
'-' => '(.+)',
};
$regex =~ s/$_/$mask_rules->{$_}/g for keys %$mask_rules;
$mask =~ s/$_/$mask_rules->{$_}/g for keys %$mask_rules;
$mask =~ s/\./x/g;
$mask =~ s/\([^)]+\)/'$' . $capture_index++/eg;
eval " \$string =~ s/^$regex\$/$mask/ ";
$string;
}
Here's a character by character solution using substr rather that split. It should be efficient for long strings since it skips processing the middle part of the string (when there is a dash).
sub apply_mask {
my $mask = shift;
my $string = shift;
my ($head, $tail) = split /-/, $mask;
for( 0 .. length($head) - 1 ) {
my $m = substr $head, $_, 1;
next if $m eq 'o';
die "Bad char $m\n" if $m ne 'x';
substr($string, $_, 1) = 'x';
}
return $string unless defined $tail;
$tail = reverse $tail;
my $last_char = length($string) - 1;
for( 0 .. length($tail) - 1 ) {
my $m = substr $tail, $_, 1;
next if $m eq 'o';
die "Bad char $m\n" if $m ne 'x';
substr($string, $last_char - $_, 1) = 'x';
}
return $string;
}
sub mask {
local $_ = $_[0];
my $mask = $_[1];
$mask =~ s/-/'o' x (length($_)-(length($mask)-1))/e;
s/(.)/substr($mask, pos, 1) eq 'o' && $1/eg;
return $_;
}
Used tidbits from a couple answers ... this is what I ended up with.
EDIT: update from comments