perl: pattern matching for comparision pair of string

perl: pattern matching for comparision pair of string - regex

I have a problem to count the majority selectedresult for each pair of string. my code: it seems just count if user choose either sysA, sysB or both without considering the pair of string. I'm also have a problem to make multiple comparision and deal with 7 users for each pair.
( $file = <INFILE> ) {
#field = parse_csv($file);
chomp(#field);
#query = $field[1];
for($i=0;$i<#query;++$i) {
if ( ($field[2] eq $method) || ($field[3] eq $method)){
if ( $field[4] eq $field[2]) {
print "$query[$i]: $field[2], $field[3], $field[4]\n";
$counta++;
}
if ( $field[4] eq $field[3]) {
print "$query[$i]: $field[2], $field[3]: $field[4]\n";
$countb++;
}
if ( $field[4] eq ($field[2] && $field[3])) {
#print "$query[$i]: $field[2]$field[3]\n";
$countc++;
}
data: for each query, i have 3 different combination of string comparision.
comparison("lucene-std-rel","lucene-noLen-rr");
comparison("lucene-noLen-rr","lucene-std-rel");
comparison("lucene-noLen-rr","random");
comparison( "random", "lucene-noLen-rr");
comparison("lucene-noLen-rr","lucene-nolen-rel");
Comparison("lucene-nolen-rel","lucene-noLen-rr");
example data for one pair (7 users evaluate for each pair):
user1,male,lucene-std-rel,random,lucene-std-relrandom
user2,male,lucene-std-rel,random,lucene-std-rel
user3,male,lucene-std-rel,random,lucene-std-rel
user4,male,lucene-std-rel,random,lucene-std-rel
user5,male,lucene-std-rel,random,lucene-std-relrandom
user6,male,lucene-std-rel,random,lucene-std-rel
user7,male,lucene-std-rel,random,lucene-std-rel
example output required: query 1:male fitness models
lucene-std-rel:5, random:0, both:2 ---> majority:lucene-std-rel
any help is very much appreciated.

Well, without making this more complex than you requested, here is what I came up with as a possible approach.
#!/usr/bin/perl
use strict;
my %counter = ( "A" => 0, "B" => 0, "AB" => 0, "majority" => 0);
while(<DATA>){
chomp;
next unless $_;
my ($workerId,$query,$sys1,$sys2,$resultSelected) = split(',');
$counter{$resultSelected}++;
}
$counter{'majority'} = (sort {$counter{$b} <=> $counter{$a}} keys %counter)[0];
print "A: $counter{'A'} B: $counter{'B'} both(AB): $counter{'AB'} majority: $counter{'majority'}\n";
__END__
user1,male,A,B,A
user2,male,A,B,AB
user3,male,A,B,B
user4,male,A,B,A
user5,male,A,B,A
The output of this is:
A: 3 B: 1 both(AB): 1 majority: A
I don't feel like my example to you fully addresses the idea of there being more than one type with the "majority". For instance, if both A and B are 9, I'd expect them both to be listed there. I didn't bother to do that since you didn't ask, but hopefully this will get you along the right path.

open( INFILE, "compare.csv" ) or die("Can not open input file: $!");
while ( $file = <INFILE> ) {
#field = parse_csv($file);
chomp(#field);
#query = $field[1];
for($i=0;$i<#query;++$i) {
if ( ($field[2] eq $method) || ($field[3] eq $method)){
if ( $field[4] eq $field[2]) {
print "$query[$i]: $field[2], $field[3], $field[4]\n";
$counta++;
}
if ( $field[4] eq $field[3]) {
print "$query[$i]: $field[2], $field[3]: $field[4]\n";
$countb++;
}
if ( $field[4] eq ($field[2] && $field[3])) {
#print "$query[$i]: $field[2]$field[3]\n";
$countc++;
}
}
}
sub parse_csv {
my $text = shift;
my #new = ();
push( #new, $+ ) while $text =~ m{
"([^\"\](?:\.[^\"\])*)",?
| ([^,]+),?
| ,
}gx;
push( #new, undef ) if substr( $text, -1, 1 ) eq ',';
return #new;
}`

Related

Generate pseudo-random list in Perl

I have a list with 79 entries that each look similar to this:
"YellowCircle1.png\tc\tColor"
That is, each entry has 3 elements (.png-file, a letter, and a category). The category can be color, number or shape.
I want to create a new list from this, pseudo-randomized. That is, I want to have all 79 entries in a random order, but with a limitation.
I have created a perl script for a completely random version using shuffle:
# !/usr/bin/perl
# Perl script to generate input list for E-Prime experiment
# with semi-randomized trials
# Date: 2020-12-30
# Open text file
$filename = 'output_shuffled.txt';
open($fh, '>', $filename) or die "Could not open file '$filename'";
# Generate headline
print $fh "Weight\tNested\tProcedure\tCardIMG1\tCardIMG3\tCardIMG4\tCardStim\tCorrectAnswer\tTrialType\n";
# Array with list of stimuli including corresponding correct response and trial type
#stimulus = (
"BlueCross1.png\tm\tColor",
"BlueCross2.png\tm\tColor",
"BlueStar1.png\tm\tColor",
"BlueStar3.png\tm\tColor",
"BlueTriangle2.png\tm\tColor",
"BlueTriangle3.png\tm\tColor",
"GreenCircle1.png\tv\tColor",
"GreenCircle3.png\tv\tColor",
"GreenCircle1.png\tv\tColor",
"GreenCircle3.png\tv\tColor",
"GreenCross1.png \tv\tColor",
"GreenCross4.png\tv\tColor",
"GreenTriangle3.png\tv\tColor",
"GreenTriangle4.png\tv\tColor",
"RedCircle2.png\tc\tColor",
"RedCircle3.png\tc\tColor",
"RedCross2.png\tc\tColor",
"RedCross4.png\tc\tColor",
"RedStar3.png\tc\tColor",
"RedStar4.png\tc\tColor",
"YellowCircle1.png\tn\tColor",
"YellowCircle2.png\tn\tColor",
"YellowStar1.png\tn\tColor",
"YellowTriangle2.png\tn\tColor",
"YellowTriangle4.png\tn\tColor",
"BlueCross1.png\tc\tNumber",
"BlueCross2.png\tv\tNumber",
"BlueStar1.png\tc\tNumber",
"BlueStar3.png\tn\tNumber",
"BlueTriangle2.png\tv\tNumber",
"GreenCircle1.png\tc\tNumber",
"GreenCircle3.png\tn\tNumber",
"BlueCross1.png\tm\tColor",
"BlueCross2.png\tm\tColor",
"BlueStar1.png\tm\tColor",
"BlueStar3.png\tm\tColor",
"BlueTriangle2.png\tv\tNumber",
"BlueTriangle3.png\tn\tNumber",
"GreenCircle1.png\tc\tNumber",
"GreenCircle3.png\tn\tNumber",
"GreenCross1.png\tc\tColor",
"GreenCross4.png\tm\tColor",
"GreenTriangle3.png\tn\tColor",
"GreenTriangle4.png\tm\tColor",
"RedCircle2.png\tv\tNumber",
"RedCircle3.png\tn\tNumber",
"RedCross2.png\tv\tNumber",
"RedCross4.png\tm\tNumber",
"RedStar3.png\tn\tColor",
"RedStar4.png\tm\tColor",
"YellowCircle1.png\tc\tColor",
"YellowCircle2.png\tv\tColor",
"YellowStar1.png\tc\tNumber",
"YellowStar4.png\tm\tNumber",
"YellowTriangle2.png\tv\tNumber",
"YellowTriangle4.png\tm\tNumber",
"BlueCross1.png\tn\tShape",
"BlueCross2.png\tn\tShape",
"BlueStar1.png\tv\tShape",
"BlueStar3.png\tv\tShape",
"BlueTriangle2.png\tc\tShape",
"BlueTriangle3.png\tc\tShape",
"GreenCircle1.png\tm\tShape",
"GreenCircle3.png\tm Shape",
"GreenCross1.png\tn\tShape",
"GreenCross4.png\tn\tShape",
"GreenTriangle3.png\tc\tShape",
"GreenTriangle4.png\tc\tShape",
"RedCircle2.png\tm\tShape",
"RedCircle3.png\tm\tShape",
"RedCross2.png\tn\tShape",
"RedCross4.png\tn\tShape",
"RedStar3.png\tv\tShape",
"RedStar4.png\tv\tShape",
"YellowCircle1.png\tm\tShape",
"YellowCircle2.png\tm\tShape",
"YellowStar1.png\tv\tShape",
"YellowStar4.png\tv\tShape",
"YellowTriangle2.png\tc\tShape",
"YellowTriangle4.png\tc\tShape",
);
# Shuffle --> Pick at random without double entries
use List::Util 'shuffle';
#shuffled = shuffle(#stimulus);
# Print each line with fixed values and shuffled stimulus entries to file
print $fh "1\t" . "\t" . "TrialProc\t" . "RedTriangle1.png\t" . "Greenstar2.png\t" . "YellowCross3.png\t" . "BlueCircle4.png\t" . "\t$_\n" for #shuffled;
# Close text file
close($fh);
# Print to terminal
print "Done\n";
However, what I eventually want is that the category does not switch more than once successively, but every 3 up to 5 times (randomly between these numbers). For example, if one line ends with "shape" and the following line with "color", the next line would have to be "color", because otherwise there would be 2 switches successively.
How would I create this? I suspect I would have to change the entries to something like hashes, so that I can create if-constructions based on the last element (that is "category") of each entry?

The solution - as you already guessed - is to split the data and reshuffle the parts that dont fit with your rules.
Here is the code that does that.
# Shuffle --> Pick at random without double entries
use List::Util 'shuffle';
my #data = shuffle(map {[split("\t")]} #stimulus);
my #result, %used;
my $next = 0;
while (#result < #data) {
my $pick = pick($next);
if ($pick >= 0) {
push #result, $pick;
$used{$pick} = 1;
$next = 0;
} elsif (#result == 0) {
die "no valid solution found"
} else {
## backtrack
print ".";
$next = pop( #result )+1;
$used{$next-1} = 0;
}
}
my #shuffled = map {join("\t", #{$data[$_]})} #result;
using backtracking if no solution is found. (This is highly inefficient - a reshuffling would probably be better)
It uses a sub pick which returns the index of a next fitting entry. (If possibe)
sub pick {
my $next_element = shift;
foreach my $element ($next_element .. $#data) {
next if $used {$element};
my $type = $data[$element][2];
if( $data[$result[-1]][2] eq $type ){
if (#result >3) {
next
if ($type eq $data[$result[-2]][2] &&
$type eq $data[$result[-3]][2] &&
$type eq $data[$result[-4]][2] )
}
} else {
if (#result >1) {
next
if ($data[$result[-1]][2] ne $data[$result[-2]][2]);
}
}
return $element;
}
return -1;
}
In the sub pick
if( $data[$result[-1]][2] eq $type ){
if (#result >3) {
next
if ($type eq $data[$result[-2]][2] &&
$type eq $data[$result[-3]][2] &&
$type eq $data[$result[-4]][2] )
}
disallows 5 times the same type in a row. If you want only to dissalow 6 times the same type you have to change it to
if( $data[$result[-1]][2] eq $type ){
if (#result >4) {
next
if ($type eq $data[$result[-2]][2] &&
$type eq $data[$result[-3]][2] &&
$type eq $data[$result[-4]][2] &&
$type eq $data[$result[-5]][2] )
}
The code:
if (#result >1) {
next
if ($data[$result[-1]][2] ne $data[$result[-2]][2]);
}
enforces 3 times (at least) the same type. If you want to change this to 4 times you have to use
if (#result >2) {
next
if ($data[$result[-1]][2] ne $data[$result[-2]][2]
|| $data[$result[-1]][2] ne $data[$result[-3]][2]);
}

Print out preceding lines after regex match

I am writing a sort of quiz program. I am using a .txt file as a test bank, but cant figure out how to (using regex's) match each question and print out the possible answers on different lines.I originally was just going to do true false so I didnt need to match anything else and just matching "1" worked fine. Basically I just need the question on one line and the answers on others. Here is an example of a question
1.) some text
a.) solution
b.) solution
c.) solution
code i had before:
while (<$test>) {
foreach my $line (split /\n/) {
my $match1 = "1";
if ($line =~ /$match1/) {
$question1 = $line;
print "$question1\n";
print "Answer: ";
$answer1 = <>;
chomp ($answer1);
if ( $answer1 =~ /(^a$)/i) {
$score1 = 20;
push #score, $score1;
}
}

I really couldn't get what you were getting at, so I wrote this sample program.
use 5.016;
use strict;
use warnings;
my ( #lines, #questions, $current_question );
sub prompt {
my ( $prompt ) = #_;
print $prompt, ' ';
STDOUT->flush;
my $val = <>;
return $val;
}
QUESTION:
while ( <DATA> ) {
if ( my ( $ans ) = m/^=(\w+)/ ) {
INPUT: {
say #lines;
last unless defined( my $answer = prompt( 'Your answer:' ));
say '';
my ( $response ) = $answer =~ /([a-z])\s*$/;
if ( not $response ) {
$answer =~ s/\s*$//; #/
say "Invalid response. '$answer' is not an answer!\n";
redo INPUT;
}
if ( $response eq $ans ) {
say 'You are right!';
}
elsif ( my $ansln = $current_question->{$response} ) {
if ( $response eq 'q' ) {
say 'Quitting...';
last QUESTION;
}
say <<"END_SAY";
You chose:\n$current_question->{$response}
The correct answer was:\n$current_question->{$ans}
END_SAY
}
else {
say "Invalid response. '$response' is not an answer!\n";
redo INPUT;
}
};
#lines = ();
prompt( 'Press enter to continue.' );
say '';
}
else {
if ( my ( $qn, $q ) = m/^\s*(\d+)\.\)\s+(.*\S)\s*$/ ) {
push #questions, $current_question = { question => $q };
}
else {
my ( $l, $a ) = m/^\s+([a-z])/;
$current_question->{$l} = ( m/(.*)/ )[0];
}
push #lines, $_;
}
}
__DATA__
1.) Perl is
a.) essential
b.) fun
c.) useful
=c
2.) This question is
a.) Number two
b.) A test to see how this format is parsed.
c.) Unneeded
=b

This is probably over simplified.
It just reads in the test data and creates a structure.
You could use it to grade test takers answers.
use strict;
use warnings;
use Data::Dumper;
$/ = undef;
my $testdata = <DATA>;
my %HashTest = ();
my $hchoices;
my $hqeustion;
my $is_question = 0;
while ( $testdata =~ /(^.*)\n/mg )
{
my $line = $1;
$line =~ s/^\s+|\s+$//g;
next if ( length( $line ) == 0);
if ( $line =~ /^(\d+)\s*\.\s*\)\s*(.*)/ )
{
$is_question = 1;
$HashTest{ $1 }{'question'} = $2;
$HashTest{ $1 }{'choices'} = {};
$HashTest{ $1 }{'answer'} = 'unknown';
$hqeustion = $HashTest{ $1 };
$hchoices = $HashTest{ $1 }{'choices'};
}
elsif ( $is_question && $line =~ /^\s*(answer)\s*:\s*([a-z])/ )
{
$hqeustion->{'answer'} = $2;
}
elsif ( $is_question && $line =~ /^\s*([a-z])\s*\.\s*\)\s*(.*)/ )
{
$hchoices->{ $1 } = $2;
}
}
print "\nQ & A summary\n-------------------------\n";
for my $qnum ( keys %HashTest )
{
print "Question $qnum: $HashTest{$qnum}{'question'}'\n";
my $ans_code = $HashTest{$qnum}{'answer'};
print "Answer: ($ans_code) $HashTest{$qnum}{'choices'}{$ans_code}\n\n";
}
print "---------------------------\n";
print Dumper(\%HashTest);
__DATA__
1.) What is the diameter of the earth?
a.) Half the distance to the sun
b.) Same as the moon
c.) 6,000 miles
answer: c
2.) Who is buried in Grants Tomb?
a.) Thomas Edison
b.) Grant, who else
c.) Jimi Hendrix
answer: b
Output:
Q & A summary
-------------------------
Question 1: What is the diameter of the earth?'
Answer: (c) 6,000 miles
Question 2: Who is buried in Grants Tomb?'
Answer: (b) Grant, who else
---------------------------
$VAR1 = {
'1' => {
'question' => 'What is the diameter of the earth?',
'answer' => 'c',
'choices' => {
'c' => '6,000 miles',
'a' => 'Half the distance to the sun',
'b' => 'Same as the moon'
}
},
'2' => {
'question' => 'Who is buried in Grants Tomb?',
'answer' => 'b',
'choices' => {
'c' => 'Jimi Hendrix',
'a' => 'Thomas Edison',
'b' => 'Grant, who else'
}
}
};

Perl substitute strings in hash

Hi I have an perl hash of arbitrary depth. I want to substitute the a string in entire structure with something else.
What is the right approach to do it?
I did something like this
#convert the hash to string for manipulation
my $data = YAML::Dump(%hash);
# do manipulation
---
---
# time to get back the hash
%hash = YAML::Load($data);

Your idea seems very risky to me, since it can be hard to be sure that the substitution won't destroy something in the output of YAML::Dump that will prevent the result from being read back in again, or worse, something that will alter the structure of the hash as it is represented in the dump string. What if the manipulation you are trying to perform is to replace : with , or ’ with ', or something of that sort?
I would probably do something more like this:
use Scalar::Util 'reftype';
# replace $this with $that in key names and string values of $hash
# recursively apply replacement in hash and all its subhashes
sub hash_replace {
my ($hash, $this, $that) = #_;
for my $k (keys %$hash) {
# substitution in value
my $v = $hash->{$k};
if (ref $v && reftype($v) eq "HASH") {
hash_replace($v, $this, $that);
} elsif (! ref $v) {
$v =~ s/$this/$that/og;
}
my $new_hash = {};
for my $k (keys %$hash) {
# substitution in key
(my $new_key = $k) =~ s/$this/$that/og;
$new_hash->{$new_key} = $hash->{$k};
}
%$hash = %$new_hash; # replace old keys with new keys
}
The s/…/…/ replacement I used here may not be appropriate for your task; you should feel free to use something else. For example, instead of strings $this and $that you might pass two functions, $key_change and $val_change which are applied to keys and to values, respectively, returning the modified versions. See the ###### lines below:
use Scalar::Util 'reftype';
# replace $this with $that in key names and string values of $hash
# recursively apply replacement in hash and all its subhashes
sub hash_replace {
my ($hash, $key_change, $val_change) = #_;
for my $k (keys %$hash) {
# substitution in value
my $v = $hash->{$k};
if (ref $v && reftype($v) eq "HASH") {
hash_replace($v, $key_change, $val_change);
} elsif (! ref $v) {
$v = $val_change->($v); #######
}
}
my $new_hash = {};
for my $k (keys %$hash) {
# substitution in key
my $new_key = $key_change->($k); #######
$new_hash->{$new_key} = $hash->{$k};
}
%$hash = %$new_hash;
}

Here's one way to attack it, by recursing through the hash. In this code, you pass in a sub that does whatever you like to each value in the nested hash. This code only modifies the values, not the keys, and it ignores other reference types (ie. scalar refs, array refs) in the nested structure.
#!/usr/bin/perl -w
use Modern::Perl;
## Visit all nodes in a nested hash. Bare-bones.
sub visit_hash
{
my ($start, $sub) = #_;
my #q = ( $start );
while (#q) {
my $hash = pop #q;
foreach my $key ( keys %{$hash} ) {
my $ref = ref($hash->{$key});
if ( $ref eq "" ) { # not a reference
&$sub( $hash->{$key} );
next;
}
if ( $ref eq "HASH" ) { # reference to a nested hash
push #q, $hash->{$key};
next;
}
# ignore other reference types.
}
}
}
The following gives an example of how to use it, replacing e with E in a nested hash:
# Example of replacing a string in all values:
my %hash =
(
a => "fred",
b => "barney",
c => "wilma",
d => "betty",
nest1 =>
{
1 => "red",
2 => "orange",
3 => "green"
},
nest2 =>
{
x => "alpha",
y => "beta",
z => "gamma"
},
);
use YAML::XS;
print "Before:\n";
print Dump( \%hash );
# now replace 'e' with 'E' in all values.
visit_hash( \%hash, sub { $_[0] =~ s/e/E/g; } );
print "After:\n";
print Dump( \%hash );

How to automagically create pattern based on real data?

I have many vendors in database, they all differ in some aspect of their data. I'd like to make data validation rule which is based on previous data.
Example:
A: XZ-4, XZ-23, XZ-217
B: 1276, 1899, 22711
C: 12-4, 12-75, 12
Goal: if user inputs string 'XZ-217' for vendor B, algorithm should compare previous data and say: this string is not similar to vendor B previous data.
Is there some good way/tools to achieve such comparison? Answer could be some generic algoritm or Perl module.
Edit:
The "similarity" is hard to define, i agree. But i'd like to catch to algorithm, which could analyze previous ca 100 samples and then compare the outcome of analyze with new data. Similarity may based on length, on use of characters/numbers, string creation patterns, similar beginning/end/middle, having some separators in.
I feel it is not easy task, but on other hand, i think it has very wide use. So i hoped, there is already some hints.

You may want to peruse:
http://en.wikipedia.org/wiki/String_metric and http://search.cpan.org/dist/Text-Levenshtein/Levenshtein.pm (for instance)

Joel and I came up with similar ideas. The code below differentiates 3 types of zones.
one or more non-word characters
alphanumeric cluster
a cluster of digits
It creates a profile of the string and a regex to match input. In addition, it also contains logic to expand existing profiles. At the end, in the task sub, it contains some pseudo logic which indicates how this might be integrated into a larger application.
use strict;
use warnings;
use List::Util qw<max min>;
sub compile_search_expr {
shift;
#_ = #{ shift() } if #_ == 1;
my $str
= join( '|'
, map { join( ''
, grep { defined; }
map {
$_ eq 'P' ? quotemeta;
: $_ eq 'W' ? "\\w{$_->[1],$_->[2]}"
: $_ eq 'D' ? "\\d{$_->[1],$_->[2]}"
: undef
;
} #$_
)
} #_ == 1 ? #{ shift } : #_
);
return qr/^(?:$str)$/;
}
sub merge_profiles {
shift;
my ( $profile_list, $new_profile ) = #_;
my $found = 0;
PROFILE:
for my $profile ( #$profile_list ) {
my $profile_length = #$profile;
# it's not the same profile.
next PROFILE unless $profile_length == #$new_profile;
my #merged;
for ( my $i = 0; $i < $profile_length; $i++ ) {
my $old = $profile->[$i];
my $new = $new_profile->[$i];
next PROFILE unless $old->[0] eq $new->[0];
push( #merged
, [ $old->[0]
, min( $old->[1], $new->[1] )
, max( $old->[2], $new->[2] )
]);
}
#$profile = #merged;
$found = 1;
last PROFILE;
}
push #$profile_list, $new_profile unless $found;
return;
}
sub compute_info_profile {
shift;
my #profile_chunks
= map {
/\W/ ? [ P => $_ ]
: /\D/ ? [ W => length, length ]
: [ D => length, length ]
}
grep { length; } split /(\W+)/, shift
;
}
# Psuedo-Perl
sub process_input_task {
my ( $application, $input ) = #_;
my $patterns = $application->get_patterns_for_current_customer;
my $regex = $application->compile_search_expr( $patterns );
if ( $input =~ /$regex/ ) {}
elsif ( $application->approve_divergeance( $input )) {
$application->merge_profiles( $patterns, compute_info_profile( $input ));
}
else {
$application->escalate(
Incident->new( issue => INVALID_FORMAT
, input => $input
, customer => $customer
));
}
return $application->process_approved_input( $input );
}

Here is my implementation and a loop over your test cases. Basically you give a list of good values to the function and it tries to build a regex for it.
output:
A: (?^:\w{2,2}(?:\-){1}\d{1,3})
B: (?^:\d{4,5})
C: (?^:\d{2,2}(?:\-)?\d{0,2})
code:
#!/usr/bin/env perl
use strict;
use warnings;
use List::MoreUtils qw'uniq each_arrayref';
my %examples = (
A => [qw/ XZ-4 XZ-23 XZ-217 /],
B => [qw/ 1276 1899 22711 /],
C => [qw/ 12-4 12-75 12 /],
);
foreach my $example (sort keys %examples) {
print "$example: ", gen_regex(#{ $examples{$example} }) || "Generate failed!", "\n";
}
sub gen_regex {
my #cases = #_;
my %exploded;
# ex. $case may be XZ-217
foreach my $case (#cases) {
my #parts =
grep { defined and length }
split( /(\d+|\w+)/, $case );
# #parts are ( XZ, -, 217 )
foreach (#parts) {
if (/\d/) {
# 217 becomes ['\d' => 3]
push #{ $exploded{$case} }, ['\d' => length];
} elsif (/\w/) {
#XZ becomes ['\w' => 2]
push #{ $exploded{$case} }, ['\w' => length];
} else {
# - becomes ['lit' => '-']
push #{ $exploded{$case} }, ['lit' => $_ ];
}
}
}
my $pattern = '';
# iterate over nth element (part) of each case
my $ea = each_arrayref(values %exploded);
while (my #parts = $ea->()) {
# remove undefined (i.e. optional) parts
my #def_parts = grep { defined } #parts;
# check that all (defined) parts are the same type
my #part_types = uniq map {$_->[0]} #def_parts;
if (#part_types > 1) {
warn "Parts not aligned\n";
return;
}
my $type = $part_types[0]; #same so make scalar
# were there optional parts?
my $required = (#parts == #def_parts);
# keep the values of each part
# these are either a repitition or lit strings
my #values = sort uniq map { $_->[1] } #def_parts;
# these are for non-literal quantifiers
my $min = $required ? $values[0] : 0;
my $max = $values[-1];
# write the specific pattern for each type
if ($type eq '\d') {
$pattern .= '\d' . "{$min,$max}";
} elsif ($type eq '\w') {
$pattern .= '\w' . "{$min,$max}";
} elsif ($type eq 'lit') {
# quote special characters, - becomes \-
my #uniq = map { quotemeta } uniq #values;
# join with alternations, surround by non-capture grouup, add quantifier
$pattern .= '(?:' . join('|', #uniq) . ')' . ($required ? '{1}' : '?');
}
}
# build the qr regex from pattern
my $regex = qr/$pattern/;
# test that all original patterns match (#fail should be empty)
my #fail = grep { $_ !~ $regex } #cases;
if (#fail) {
warn "Some cases fail for generated pattern $regex: (#fail)\n";
return '';
} else {
return $regex;
}
}
To simplify the work of finding the pattern, optional parts may come at the end, but no required parts may come after optional ones. This could probably be overcome but it might be hard.

If there was a Tie::StringApproxHash module, it would fit the bill here.
I think you're looking for something that combines the fuzzy-logic functionality of String::Approx and the hash interface of Tie::RegexpHash.
The former is more important; the latter would make light work of coding.

Perl regex syntax generation

This is a follow up to the question posted here: Perl Regex syntax
The results from that discussion yielded this script:
#!/usr/bin/env perl
use strict;
use warnings;
my #lines = <DATA>;
my $current_label = '';
my #ordered_labels;
my %data;
for my $line (#lines) {
if ( $line =~ /^\/(.*)$/ ) { # starts with slash
$current_label = $1;
push #ordered_labels, $current_label;
next;
}
if ( length $current_label ) {
if ( $line =~ /^(\d) "(.*)"$/ ) {
$data{$current_label}{$1} = $2;
next;
}
}
}
for my $label ( #ordered_labels ) {
print "$label <- as.factor($label\n";
print " , levels= c(";
print join(',',map { $_ } sort keys %{$data{$label}} );
print ")\n";
print " , labels= c(";
print join(',',
map { '"' . $data{$label}{$_} . '"' }
sort keys %{$data{$label}} );
print ")\n";
print " )\n";
}
__DATA__
...A bunch of nonsense I do not care about...
...
Value Labels
/gender
1 "M"
2 "F"
/purpose
1 "business"
2 "vacation"
3 "tiddlywinks"
execute .
Essentially, I need to build the Perl to accommodate a syntax shorthand found in the SPSS file. For adjacent columns, SPSS allows one to type something like:
VALUE LABELS
/agree1 to agree5
1 "Strongly disagree"
2 "Disagree"
3 "Neutral"
4 "Agree"
5 "Strongly agree"
As the script currently exists, it will generate this:
agree1 to agree5 <- factor(agree1 to agree5
, levels= c(1,2,3,4,5,6)
, labels= c("Strongly disagree","Disagree","Neutral","Agree","Strongly agree","N/A")
)
and I need it to produce something like this:
agree1 <- factor(agree1
, levels= c(1,2,3,4,5,6)
, labels= c("Strongly disagree","Disagree","Neutral","Agree","Strongly agree","N/A")
)
agree2 <- factor(agree2
, levels= c(1,2,3,4,5,6)
, labels= c("Strongly disagree","Disagree","Neutral","Agree","Strongly agree","N/A")
)
…

use strict;
use warnings;
main();
sub main {
my #lines = <DATA>;
my $vlabels = get_value_labels(#lines);
write_output_delim($vlabels);
}
# Extract the value label information from SPSS syntax.
sub get_value_labels {
my (#vlabels, $i, $j);
for my $line (#_){
if ( $line =~ /^\/(.+)/ ){
my #vars = parse_var_range($1);
$i = #vlabels;
$j = $i + #vars - 1;
push #vlabels, { var => $_, codes => [] } for #vars;
}
elsif ( $line =~ /^\s* (\d) \s+ "(.*)"$/x ){
push #{$vlabels[$_]{codes}}, [$1, $2] for $i .. $j;
}
}
return \#vlabels;
}
# A helper function to handle variable ranges: "agree1 to agree3".
sub parse_var_range {
my $vr = shift;
my #vars = split /\s+ to \s+/x, $vr;
return $vr unless #vars > 1;
my ($stem) = $vars[0] =~ /(.+?)\d+$/;
my #n = map { /(\d+)$/ } #vars;
return map { "$stem" . $_ } $n[0] .. $n[1];
}
sub write_output_delim {
my $vlabels = shift;
for my $vlab (#$vlabels){
print $vlab->{var}, "\n";
print join("\t", '', #$_), "\n" for #{$vlab->{codes}}
}
}
sub write_output_factors {
# You get the idea...
}
__DATA__
/gender
1 "M"
2 "F"
/purpose
1 "business"
2 "vacation"
3 "tiddlywinks"
/agree1 to agree3
1 "Disagree"
2 "Neutral"
3 "Agree"

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

perl: pattern matching for comparision pair of string - regex

Related

Generate pseudo-random list in Perl

Print out preceding lines after regex match

Perl substitute strings in hash

How to automagically create pattern based on real data?

Perl regex syntax generation

Categories

Resources