I have a list with 79 entries that each look similar to this:
"YellowCircle1.png\tc\tColor"
That is, each entry has 3 elements (.png-file, a letter, and a category). The category can be color, number or shape.
I want to create a new list from this, pseudo-randomized. That is, I want to have all 79 entries in a random order, but with a limitation.
I have created a perl script for a completely random version using shuffle:
# !/usr/bin/perl
# Perl script to generate input list for E-Prime experiment
# with semi-randomized trials
# Date: 2020-12-30
# Open text file
$filename = 'output_shuffled.txt';
open($fh, '>', $filename) or die "Could not open file '$filename'";
# Generate headline
print $fh "Weight\tNested\tProcedure\tCardIMG1\tCardIMG3\tCardIMG4\tCardStim\tCorrectAnswer\tTrialType\n";
# Array with list of stimuli including corresponding correct response and trial type
#stimulus = (
"BlueCross1.png\tm\tColor",
"BlueCross2.png\tm\tColor",
"BlueStar1.png\tm\tColor",
"BlueStar3.png\tm\tColor",
"BlueTriangle2.png\tm\tColor",
"BlueTriangle3.png\tm\tColor",
"GreenCircle1.png\tv\tColor",
"GreenCircle3.png\tv\tColor",
"GreenCircle1.png\tv\tColor",
"GreenCircle3.png\tv\tColor",
"GreenCross1.png \tv\tColor",
"GreenCross4.png\tv\tColor",
"GreenTriangle3.png\tv\tColor",
"GreenTriangle4.png\tv\tColor",
"RedCircle2.png\tc\tColor",
"RedCircle3.png\tc\tColor",
"RedCross2.png\tc\tColor",
"RedCross4.png\tc\tColor",
"RedStar3.png\tc\tColor",
"RedStar4.png\tc\tColor",
"YellowCircle1.png\tn\tColor",
"YellowCircle2.png\tn\tColor",
"YellowStar1.png\tn\tColor",
"YellowTriangle2.png\tn\tColor",
"YellowTriangle4.png\tn\tColor",
"BlueCross1.png\tc\tNumber",
"BlueCross2.png\tv\tNumber",
"BlueStar1.png\tc\tNumber",
"BlueStar3.png\tn\tNumber",
"BlueTriangle2.png\tv\tNumber",
"GreenCircle1.png\tc\tNumber",
"GreenCircle3.png\tn\tNumber",
"BlueCross1.png\tm\tColor",
"BlueCross2.png\tm\tColor",
"BlueStar1.png\tm\tColor",
"BlueStar3.png\tm\tColor",
"BlueTriangle2.png\tv\tNumber",
"BlueTriangle3.png\tn\tNumber",
"GreenCircle1.png\tc\tNumber",
"GreenCircle3.png\tn\tNumber",
"GreenCross1.png\tc\tColor",
"GreenCross4.png\tm\tColor",
"GreenTriangle3.png\tn\tColor",
"GreenTriangle4.png\tm\tColor",
"RedCircle2.png\tv\tNumber",
"RedCircle3.png\tn\tNumber",
"RedCross2.png\tv\tNumber",
"RedCross4.png\tm\tNumber",
"RedStar3.png\tn\tColor",
"RedStar4.png\tm\tColor",
"YellowCircle1.png\tc\tColor",
"YellowCircle2.png\tv\tColor",
"YellowStar1.png\tc\tNumber",
"YellowStar4.png\tm\tNumber",
"YellowTriangle2.png\tv\tNumber",
"YellowTriangle4.png\tm\tNumber",
"BlueCross1.png\tn\tShape",
"BlueCross2.png\tn\tShape",
"BlueStar1.png\tv\tShape",
"BlueStar3.png\tv\tShape",
"BlueTriangle2.png\tc\tShape",
"BlueTriangle3.png\tc\tShape",
"GreenCircle1.png\tm\tShape",
"GreenCircle3.png\tm Shape",
"GreenCross1.png\tn\tShape",
"GreenCross4.png\tn\tShape",
"GreenTriangle3.png\tc\tShape",
"GreenTriangle4.png\tc\tShape",
"RedCircle2.png\tm\tShape",
"RedCircle3.png\tm\tShape",
"RedCross2.png\tn\tShape",
"RedCross4.png\tn\tShape",
"RedStar3.png\tv\tShape",
"RedStar4.png\tv\tShape",
"YellowCircle1.png\tm\tShape",
"YellowCircle2.png\tm\tShape",
"YellowStar1.png\tv\tShape",
"YellowStar4.png\tv\tShape",
"YellowTriangle2.png\tc\tShape",
"YellowTriangle4.png\tc\tShape",
);
# Shuffle --> Pick at random without double entries
use List::Util 'shuffle';
#shuffled = shuffle(#stimulus);
# Print each line with fixed values and shuffled stimulus entries to file
print $fh "1\t" . "\t" . "TrialProc\t" . "RedTriangle1.png\t" . "Greenstar2.png\t" . "YellowCross3.png\t" . "BlueCircle4.png\t" . "\t$_\n" for #shuffled;
# Close text file
close($fh);
# Print to terminal
print "Done\n";
However, what I eventually want is that the category does not switch more than once successively, but every 3 up to 5 times (randomly between these numbers). For example, if one line ends with "shape" and the following line with "color", the next line would have to be "color", because otherwise there would be 2 switches successively.
How would I create this? I suspect I would have to change the entries to something like hashes, so that I can create if-constructions based on the last element (that is "category") of each entry?
The solution - as you already guessed - is to split the data and reshuffle the parts that dont fit with your rules.
Here is the code that does that.
# Shuffle --> Pick at random without double entries
use List::Util 'shuffle';
my #data = shuffle(map {[split("\t")]} #stimulus);
my #result, %used;
my $next = 0;
while (#result < #data) {
my $pick = pick($next);
if ($pick >= 0) {
push #result, $pick;
$used{$pick} = 1;
$next = 0;
} elsif (#result == 0) {
die "no valid solution found"
} else {
## backtrack
print ".";
$next = pop( #result )+1;
$used{$next-1} = 0;
}
}
my #shuffled = map {join("\t", #{$data[$_]})} #result;
using backtracking if no solution is found. (This is highly inefficient - a reshuffling would probably be better)
It uses a sub pick which returns the index of a next fitting entry. (If possibe)
sub pick {
my $next_element = shift;
foreach my $element ($next_element .. $#data) {
next if $used {$element};
my $type = $data[$element][2];
if( $data[$result[-1]][2] eq $type ){
if (#result >3) {
next
if ($type eq $data[$result[-2]][2] &&
$type eq $data[$result[-3]][2] &&
$type eq $data[$result[-4]][2] )
}
} else {
if (#result >1) {
next
if ($data[$result[-1]][2] ne $data[$result[-2]][2]);
}
}
return $element;
}
return -1;
}
In the sub pick
if( $data[$result[-1]][2] eq $type ){
if (#result >3) {
next
if ($type eq $data[$result[-2]][2] &&
$type eq $data[$result[-3]][2] &&
$type eq $data[$result[-4]][2] )
}
disallows 5 times the same type in a row. If you want only to dissalow 6 times the same type you have to change it to
if( $data[$result[-1]][2] eq $type ){
if (#result >4) {
next
if ($type eq $data[$result[-2]][2] &&
$type eq $data[$result[-3]][2] &&
$type eq $data[$result[-4]][2] &&
$type eq $data[$result[-5]][2] )
}
The code:
if (#result >1) {
next
if ($data[$result[-1]][2] ne $data[$result[-2]][2]);
}
enforces 3 times (at least) the same type. If you want to change this to 4 times you have to use
if (#result >2) {
next
if ($data[$result[-1]][2] ne $data[$result[-2]][2]
|| $data[$result[-1]][2] ne $data[$result[-3]][2]);
}
Related
#file1 contains only startpoint-endpoint pair, each indices represent each pair. file2 is a text file, for #file2 each indices represents each line. I am trying to search each pair from #file1 in #file2 line by line. When the exact match is found, I would then try to extract information1 from file2 and print it out. But for now, I am trying to search for the matched pair in file2. The format of the matching pattern is as below:
Match case
From $file1[0]
Startpoint: /source/in_out/map (positive-triggered)
Endpoint: /output/end/scan_all (positive-triggered)
match if file2 contains:
Line with other stuff
Startpoint: /source/in_out/map (positive-triggered)
Endpoint: /output/end/scan_all (positive-triggered)
information1:
information2:
Lines with other stuff
Unmatch Case:
From file1:
Startpoint: /source/in_out/map (positive-triggered)
Endpoint: /output/end/scan_all (positive-triggered)
From file2:
Startpoint: /source/in_out/map (positive-triggered)
Endpoint: /different endpoint pair/ (positive-triggered)
information1:
information2:
For text files2, I stored it in #file2. For files1, I have successfully extracted and stored every Startpoint and the next line Endpoint as the format above in #file1. (No problem in extracting and storing each pair, so I wont be showing the code for this, it took around 4mins here) Then I split each element of #address, which are the startpoint and endpoint. Checking line by line in files2, if startpoint match, then I will move on next line to check endpoint, it is only considered match if the next line after Startpoint match the Endpoint, else try to search again until the end line of files2. This script does the job but it took 3 and a half hours to complete(there are around 60k pairs from file1 and 800k lines to check in file2). Is there any other efficient way to do this?
I am new in Perl scripting, I apologize for any silly mistakes, both in my explanation and my coding.
Here's the codes:
#!usr/bin/perl
use warnings;
my $report = '/home/dir/file2';
open ( $DATA,$report ) || die "Error when opening";
chomp (#file2 = <$DATA>);
#No problem in extracting Start-Endpoint pair from file1 into #file1, so I wont include
#the code for this
$size = scalar#file1;
$size2 = scalar#file2;
for ( $total=0; $total<$size; $total++ ) {
my #file1_split = split('\n',$file1[$total]);
chomp #file1_split;
my $match_endpoint = 0;
my $split = 0;
LABEL2: for ( $count=0; $count<$size2; $count++ ) {
if ( $match_endpoint == 1) {
if ( grep { $_ eq "file1_split[$split]" } $file2[$count] )
print"Pair($total):Match Pair\n";
last LABEL2; #move on to check next start-endpoint
#pair
}
else {
$split = 0; #reset back to check the same startpoint
and continue searching until match found or end line of file2
$match_endpoint = 0;
}
}
elsif ( grep { $_ eq "$address_array[$split]"} $array[$count] )
{
$match_endpoint = 1;#enable search for endpoint in next line
$split = 1; #move on next line to match endpoint
next;
}
elsif ( $count==$size2-1 ) {
print"no matching found for Path($total)\n";
}
}
}
If I'm understanding what your code is trying to do,
it looks like it would be more efficient to do it this way:
my %split=#file1;
my %total;
#total{#file1}=(0..$#file1);
my $split;
for( #file2 ){
if( $split ){
if( $_ eq $split ){
print"Pair($total{$split}):Match Pair\n";
}else{
$split{$split}="";
}
}
$split=$split{$_};
delete $split{$_};
}
for( keys %split ){
print"no matching found for Path($total{$_})\n";
}
If I have understood your spec (show matches), I'm betting this will complete in less than 5 seconds, unless you're using an old Dell D333. To further minimize the response time, you would write some extra code to drive the while loop by the hash with the fewest keys (you implied file1). If you use references to hashes, then you can write a small if-else statement to swap the hash references without having to code duplicate while statements.
use strict;
use warnings;
sub makeHash($) {
my ($filename) = #_;
open(DATA, $filename) || die;
my %result;
my ($start, $line);
while (<DATA>) {
if ($_ =~ /^Startpoint: (.*)/) {
$start = $1; # captured group in regular expression
$line = $.; # current line number
} elsif ($_ =~ /^Endpoint: (.*)/) {
my $end = $1;
if (defined $line && $. == ($line + 1)) {
my $key = "$start::$end";
# can distinguish start and end lines if necessary
$result{$key} = {start=>$start, end=>$end, line=>$line};
}
}
}
close(DATA);
return %result;
}
my %file1 = makeHash("file1");
my %file2 = makeHash("file2");
my $fmt = "%10s %10s %s\n";
my $nmatches = 0;
printf $fmt, "File1", "File2", "Key";
while (my ($key, $f1h) = each %file1) {
my $f2h = $file2{$key};
if (defined $f2h) {
# You have access to hash members start and end if you need to distinguish further
printf $fmt, $f1h->{line}, $f2h->{line}, $key;
$nmatches++;
}
}
print "Found $nmatches matches\n";
Below, is my test data generator(thanks). I generated a worst-case scenario of 1,000,000 matches between two equal files. The matching code above finished on my MBP in under 20 seconds using the generated test data.
use strict;
use warnings;
sub rndStr { join'', #_[ map{ rand #_ } 1 .. shift ] }
open(F1, ">file1") || die;
open(F2, ">file2") || die;
for (1..1000000) {
my $start = rndStr(30, 'A'..'Z');
my $end = rndStr(30, 'A'..'Z');
print F1 "Startpoint: $start\n";
print F1 "Endpoint: $end\n";
print F2 "Startpoint: $start\n";
print F2 "Endpoint: $end\n";
}
close(F1);
close(F2);
I am trying to read a log file and write all the error logs to a new file. I must also keep track of how many errors there are and the number of messages in general. I must assume that the logs will be broken up onto multiple lines, so I have been using regex and series a variables to search for all possibilities and write to the appropriate file.
My file handles are: FILE, ERRORFILE, and SUCCESSFILE.
use strict;
use warnings;
my $totalcount = 0;
my $errorcount = 0;
my $log = "s"; # $log controls what what should be written where,
# incase it doesn't start with code.
# "s" = SuccessFile, "e" = ErrorFile
my $logStart = "y"; # used with m/^I/ or m/^E/ instead of full code
# incase the code is broken into pieces.
my $dash = 0;
while (<FILE>) {
$dash += () = $_ =~ m/-/g; # can't use tr/// because it counts at compile
if ( $dash lt 25 ) { next; } # this line skips "---Begin <Repository>---"
elsif ( m/[a-zA-Z <>]/ && $dash lt 25 ) { next; }
elsif ( $dash >= 26 ) { last; } #Ends loop at "---End <Repository>---"
if ( m/^I/ && $logStart eq "y" ) {
$log = "s";
$logStart = "n";
$totalcount++;
next;
} #Ignores nonerror logs
elsif ( m/^E/ && $logStart eq "y" ) {
chomp $_;
print ERRORFILE "$_";
$errorcount++;
$totalcount++;
$log = "e";
$logStart = "n";
}
elsif (m/ \.\n$/) { #End of log
if ( $log eq "s" ) { $logStart = "y"; next; }
print ERRORFILE "$_\n" if $log eq "e";
$logStart = "y";
}
else { #line doesn't start with code or end in " .\n"
chomp $_;
print ERRORFILE "$_" if $log eq "e";
next if $log eq "s";
}
}
print "\nThere are $errorcount error logs.\n";
print "There are $totalcount logs in the full log file.\n";
I know that the non-error logs start with I00020036 and the errors start with E03020039. Both end in " .\n"
---------- Begin <Load Repository> ---------------
I00020036: Loaded C:\Documents and Settings\dorja03\Desktop\DSMProduct\external\etpki\Linux_2.4_x86\redistrib\readme.txt into \DSM R11\external\etpki\Linux_2.4_x86\redistrib\readme.txt .
E03020039: Unable to load C:\Documents and Settings\dorja03\Desktop\DSMProduct\external\etpki\Linux_2.4_x86\redistrib\etpki_install_lib.sh into \DSM R11\external\etpki\Linux_2.4_x86\redistrib\etpki_install_lib.sh . Text file contains invalid characters .
---------- End <Load Repository> ---------------
I have been running a test sample with two lines. If the error comes up first, it will print it to the error file, along with the non-error log, and on the same line. If the non-error goes first, it doesn't recognize the error.
Is this because I'm using m// wrong or something else entirely?
Edit: Test input has been added. I also added the code to skip the header and footer.
Test output: If the non-error comes first, there are 0 errors and 1 log total.
If the non-error comes first, there is 1 error and 1 log total.
If this worked, it should have said there was 1 error and 2 logs. It also would have only printed the error to the ERRORFILE.
This won't answer why your code isn't working, but here's how I would approach the problem:
Since the logs can span over multiple lines, modify the default line-by-line behavior by tweaking $/.
Use appropriate data structures to filter the errors from non-errors. This will also allow you to defer printing till later.
The code would then look something like this:
use strict;
use warnings;
my %logs;
local $/ = " .\n";
while ( <> ) { # Now $_ is the full (multi-line) log
next if /--- Begin/; # Skip if /Begin/
last if /--- End/; # Stop processing if /End/
if ( m/^I/ ) {
push #{ $logs{nonerror} }, $_;
}
if ( m/^E/ ) {
push #{ $logs{error} }, $_;
}
}
printf "There are %d error logs\n.", scalar #{ $logs{error} // [] } ;
printf "There are %d logs in the full logfile.\n",
#{$logs{error} // []} + #{$logs{nonerror} // []};
Things I like about this approach:
Perl takes care of deciding when each log message ends (eliminates the $logStart variable altogether).
The logic is much easier to extend.
The while loop is dedicated to processing the log file (no need to ++ anything).
Use of sensibly-labeled data structures instead of temporary variables makes for easier code maintenance.
To make a formal answer, I scrapped this code and replaced it. I instead fed the file into var with a delimiter, then just split it into an array. It was much easier and cleaner. I don't however have the code anymore due to a lost flashdrive.
Just like the title laboriously explains, I'm looking for a way to pass matches from individual elements in an existing array into uninitialized variables stored in an existing array, i.e. consider the following:
my $data1;
my $data2;
my $data3;
my #data_pod(
$data1,
$data2,
$data3
);
my #array ( "a1", "b1", "c1" );
foreach (#array)
{
#GET RID OF THE 1'S BY A PATTERN MATCH
#STORE THE RESULT IN APPROPRIATE ELEMENT OF date_pod
}
The end result should make this
foreach (#data_pod)
{
print "$_\r\n";
}
produce this
a
b
c
Just assume that the match only matches singleton lowercase alphabetic characters. Of course, this is a metaphor to get at what I'm really after. I'm thinking along the lines of something akin to the following:
foreach my $cat (#quarters)
{
push #slim_quarters, $cat =~ /[A-Z][a-z]*, [A-Z][a-z]* \d*/g;
}
n.b. - I'm new to Perl. I'm a baby.
use Date::Calc qw( Today Date_to_Days Now );
# MIGHT BE ABLE TO USE THIS / MODIFY
# Lower Limit
my $year1 =;
my $month1 = 7;
my $day1 = $slim_quarters[0];
my $hour1 = 00;
my $min1 = 00;
my $minlower = ($hour1 * 60) + $min1;
# Upper Limit
my $year2 = 2001;
my $month2 = 7;
my $day2 = 6;
my $hour2 = 20;
my $min2 = 59;
my $minupper = ($hour2 * 60) + $min2;
# Current System Time
my $now = localtime();
print $now, "\n";
# Get current time from module
($year, $month, $day) = Today();
($hour, $min,) = Now();
my $minnow = ($hour * 60) + $min;
print "It is now $hour:$min or $minnow minutes since midnight\n";
$lower = Date_to_Days($year1, $month1, $day1);
$upper = Date_to_Days($year2, $month2, $day2);
$date = Date_to_Days($year, $month, $day);
print "$lower=lower\t$upper=upper\t$date=date\n";
if (($date >= $lower) && ($date <= $upper)) {
if (($date != $lower) && ($date != $upper)) {
print "Not on a start and stop day\n";
}
elsif (($date == $lower) && ($date == $upper)) {
print "same start and stop date\n";
if (($minnow >= $minlower) && ($minnow < $minupper)) {
print "match on dates and mins within range\n";
}
else {
print "BUT not within minute range\n";
exit();
}
}
elsif (($date == $lower) && ($minnow < $minlower)) {
print "before start time\n";
exit();
}
elsif (($date == $upper) && ($minnow > $minupper)) {
print "after end time\n";
exit();
}
print "GOOD\n";
}
else {
print "out of range\n";
}
This is a classic XY problem. I'll attempt to answer the question you should have asked instead of the question you actually asked.
Based on your comments, the crux of your problem seems to be parsing and comparing dates. Time::Piece, a core module since Perl 5.10, can handle both of these tasks:
#!/usr/bin/perl
use strict;
use warnings;
use 5.010;
use Time::Piece;
my $format = '%A, %B %d';
# Parse
my $t1 = Time::Piece->strptime('Monday, June 13', $format);
my $t2 = Time::Piece->strptime('Friday, June 17', $format);
# Compare
say 't1 > t2' if $t1 > $t2;
say 't2 > t1' if $t2 > $t1;
# Print
say $_->strftime('%B %d') for $t1, $t2;
Output:
t2 > t1
June 13
June 17
Well I think you may have started with the wrong design, but this will work for you. If you tell us more about the steps you took to conclude that this is what you want then I am sure we can help.
use strict;
use warnings;
use feature 'say';
my #data_pod = \my ($data1, $data2, $data3);
my #array = qw( a1 b1 c1 );
for my $i (0 .. $#array) {
${ $data_pod[$i] } = $array[$i] =~ s/1//r;
}
say for $data1, $data2, $data3;
output
a
b
c
I have many vendors in database, they all differ in some aspect of their data. I'd like to make data validation rule which is based on previous data.
Example:
A: XZ-4, XZ-23, XZ-217
B: 1276, 1899, 22711
C: 12-4, 12-75, 12
Goal: if user inputs string 'XZ-217' for vendor B, algorithm should compare previous data and say: this string is not similar to vendor B previous data.
Is there some good way/tools to achieve such comparison? Answer could be some generic algoritm or Perl module.
Edit:
The "similarity" is hard to define, i agree. But i'd like to catch to algorithm, which could analyze previous ca 100 samples and then compare the outcome of analyze with new data. Similarity may based on length, on use of characters/numbers, string creation patterns, similar beginning/end/middle, having some separators in.
I feel it is not easy task, but on other hand, i think it has very wide use. So i hoped, there is already some hints.
You may want to peruse:
http://en.wikipedia.org/wiki/String_metric and http://search.cpan.org/dist/Text-Levenshtein/Levenshtein.pm (for instance)
Joel and I came up with similar ideas. The code below differentiates 3 types of zones.
one or more non-word characters
alphanumeric cluster
a cluster of digits
It creates a profile of the string and a regex to match input. In addition, it also contains logic to expand existing profiles. At the end, in the task sub, it contains some pseudo logic which indicates how this might be integrated into a larger application.
use strict;
use warnings;
use List::Util qw<max min>;
sub compile_search_expr {
shift;
#_ = #{ shift() } if #_ == 1;
my $str
= join( '|'
, map { join( ''
, grep { defined; }
map {
$_ eq 'P' ? quotemeta;
: $_ eq 'W' ? "\\w{$_->[1],$_->[2]}"
: $_ eq 'D' ? "\\d{$_->[1],$_->[2]}"
: undef
;
} #$_
)
} #_ == 1 ? #{ shift } : #_
);
return qr/^(?:$str)$/;
}
sub merge_profiles {
shift;
my ( $profile_list, $new_profile ) = #_;
my $found = 0;
PROFILE:
for my $profile ( #$profile_list ) {
my $profile_length = #$profile;
# it's not the same profile.
next PROFILE unless $profile_length == #$new_profile;
my #merged;
for ( my $i = 0; $i < $profile_length; $i++ ) {
my $old = $profile->[$i];
my $new = $new_profile->[$i];
next PROFILE unless $old->[0] eq $new->[0];
push( #merged
, [ $old->[0]
, min( $old->[1], $new->[1] )
, max( $old->[2], $new->[2] )
]);
}
#$profile = #merged;
$found = 1;
last PROFILE;
}
push #$profile_list, $new_profile unless $found;
return;
}
sub compute_info_profile {
shift;
my #profile_chunks
= map {
/\W/ ? [ P => $_ ]
: /\D/ ? [ W => length, length ]
: [ D => length, length ]
}
grep { length; } split /(\W+)/, shift
;
}
# Psuedo-Perl
sub process_input_task {
my ( $application, $input ) = #_;
my $patterns = $application->get_patterns_for_current_customer;
my $regex = $application->compile_search_expr( $patterns );
if ( $input =~ /$regex/ ) {}
elsif ( $application->approve_divergeance( $input )) {
$application->merge_profiles( $patterns, compute_info_profile( $input ));
}
else {
$application->escalate(
Incident->new( issue => INVALID_FORMAT
, input => $input
, customer => $customer
));
}
return $application->process_approved_input( $input );
}
Here is my implementation and a loop over your test cases. Basically you give a list of good values to the function and it tries to build a regex for it.
output:
A: (?^:\w{2,2}(?:\-){1}\d{1,3})
B: (?^:\d{4,5})
C: (?^:\d{2,2}(?:\-)?\d{0,2})
code:
#!/usr/bin/env perl
use strict;
use warnings;
use List::MoreUtils qw'uniq each_arrayref';
my %examples = (
A => [qw/ XZ-4 XZ-23 XZ-217 /],
B => [qw/ 1276 1899 22711 /],
C => [qw/ 12-4 12-75 12 /],
);
foreach my $example (sort keys %examples) {
print "$example: ", gen_regex(#{ $examples{$example} }) || "Generate failed!", "\n";
}
sub gen_regex {
my #cases = #_;
my %exploded;
# ex. $case may be XZ-217
foreach my $case (#cases) {
my #parts =
grep { defined and length }
split( /(\d+|\w+)/, $case );
# #parts are ( XZ, -, 217 )
foreach (#parts) {
if (/\d/) {
# 217 becomes ['\d' => 3]
push #{ $exploded{$case} }, ['\d' => length];
} elsif (/\w/) {
#XZ becomes ['\w' => 2]
push #{ $exploded{$case} }, ['\w' => length];
} else {
# - becomes ['lit' => '-']
push #{ $exploded{$case} }, ['lit' => $_ ];
}
}
}
my $pattern = '';
# iterate over nth element (part) of each case
my $ea = each_arrayref(values %exploded);
while (my #parts = $ea->()) {
# remove undefined (i.e. optional) parts
my #def_parts = grep { defined } #parts;
# check that all (defined) parts are the same type
my #part_types = uniq map {$_->[0]} #def_parts;
if (#part_types > 1) {
warn "Parts not aligned\n";
return;
}
my $type = $part_types[0]; #same so make scalar
# were there optional parts?
my $required = (#parts == #def_parts);
# keep the values of each part
# these are either a repitition or lit strings
my #values = sort uniq map { $_->[1] } #def_parts;
# these are for non-literal quantifiers
my $min = $required ? $values[0] : 0;
my $max = $values[-1];
# write the specific pattern for each type
if ($type eq '\d') {
$pattern .= '\d' . "{$min,$max}";
} elsif ($type eq '\w') {
$pattern .= '\w' . "{$min,$max}";
} elsif ($type eq 'lit') {
# quote special characters, - becomes \-
my #uniq = map { quotemeta } uniq #values;
# join with alternations, surround by non-capture grouup, add quantifier
$pattern .= '(?:' . join('|', #uniq) . ')' . ($required ? '{1}' : '?');
}
}
# build the qr regex from pattern
my $regex = qr/$pattern/;
# test that all original patterns match (#fail should be empty)
my #fail = grep { $_ !~ $regex } #cases;
if (#fail) {
warn "Some cases fail for generated pattern $regex: (#fail)\n";
return '';
} else {
return $regex;
}
}
To simplify the work of finding the pattern, optional parts may come at the end, but no required parts may come after optional ones. This could probably be overcome but it might be hard.
If there was a Tie::StringApproxHash module, it would fit the bill here.
I think you're looking for something that combines the fuzzy-logic functionality of String::Approx and the hash interface of Tie::RegexpHash.
The former is more important; the latter would make light work of coding.
I have a problem to count the majority selectedresult for each pair of string. my code: it seems just count if user choose either sysA, sysB or both without considering the pair of string. I'm also have a problem to make multiple comparision and deal with 7 users for each pair.
( $file = <INFILE> ) {
#field = parse_csv($file);
chomp(#field);
#query = $field[1];
for($i=0;$i<#query;++$i) {
if ( ($field[2] eq $method) || ($field[3] eq $method)){
if ( $field[4] eq $field[2]) {
print "$query[$i]: $field[2], $field[3], $field[4]\n";
$counta++;
}
if ( $field[4] eq $field[3]) {
print "$query[$i]: $field[2], $field[3]: $field[4]\n";
$countb++;
}
if ( $field[4] eq ($field[2] && $field[3])) {
#print "$query[$i]: $field[2]$field[3]\n";
$countc++;
}
data: for each query, i have 3 different combination of string comparision.
comparison("lucene-std-rel","lucene-noLen-rr");
comparison("lucene-noLen-rr","lucene-std-rel");
comparison("lucene-noLen-rr","random");
comparison( "random", "lucene-noLen-rr");
comparison("lucene-noLen-rr","lucene-nolen-rel");
Comparison("lucene-nolen-rel","lucene-noLen-rr");
example data for one pair (7 users evaluate for each pair):
user1,male,lucene-std-rel,random,lucene-std-relrandom
user2,male,lucene-std-rel,random,lucene-std-rel
user3,male,lucene-std-rel,random,lucene-std-rel
user4,male,lucene-std-rel,random,lucene-std-rel
user5,male,lucene-std-rel,random,lucene-std-relrandom
user6,male,lucene-std-rel,random,lucene-std-rel
user7,male,lucene-std-rel,random,lucene-std-rel
example output required: query 1:male fitness models
lucene-std-rel:5, random:0, both:2 ---> majority:lucene-std-rel
any help is very much appreciated.
Well, without making this more complex than you requested, here is what I came up with as a possible approach.
#!/usr/bin/perl
use strict;
my %counter = ( "A" => 0, "B" => 0, "AB" => 0, "majority" => 0);
while(<DATA>){
chomp;
next unless $_;
my ($workerId,$query,$sys1,$sys2,$resultSelected) = split(',');
$counter{$resultSelected}++;
}
$counter{'majority'} = (sort {$counter{$b} <=> $counter{$a}} keys %counter)[0];
print "A: $counter{'A'} B: $counter{'B'} both(AB): $counter{'AB'} majority: $counter{'majority'}\n";
__END__
user1,male,A,B,A
user2,male,A,B,AB
user3,male,A,B,B
user4,male,A,B,A
user5,male,A,B,A
The output of this is:
A: 3 B: 1 both(AB): 1 majority: A
I don't feel like my example to you fully addresses the idea of there being more than one type with the "majority". For instance, if both A and B are 9, I'd expect them both to be listed there. I didn't bother to do that since you didn't ask, but hopefully this will get you along the right path.
open( INFILE, "compare.csv" ) or die("Can not open input file: $!");
while ( $file = <INFILE> ) {
#field = parse_csv($file);
chomp(#field);
#query = $field[1];
for($i=0;$i<#query;++$i) {
if ( ($field[2] eq $method) || ($field[3] eq $method)){
if ( $field[4] eq $field[2]) {
print "$query[$i]: $field[2], $field[3], $field[4]\n";
$counta++;
}
if ( $field[4] eq $field[3]) {
print "$query[$i]: $field[2], $field[3]: $field[4]\n";
$countb++;
}
if ( $field[4] eq ($field[2] && $field[3])) {
#print "$query[$i]: $field[2]$field[3]\n";
$countc++;
}
}
}
sub parse_csv {
my $text = shift;
my #new = ();
push( #new, $+ ) while $text =~ m{
"([^\"\](?:\.[^\"\])*)",?
| ([^,]+),?
| ,
}gx;
push( #new, undef ) if substr( $text, -1, 1 ) eq ',';
return #new;
}`