Perl - Using regex to match input in hash key or value - regex

First, this is a homework assignment. I am having a tough time with regex, and I'm stuck.
This is the code I have so far, where I have the user designate a filename, and if it exists, populates a hash of the names as keys, and the phone numbers as the values.
#!/usr/bin/perl
use strict;
print "\nEnter Filename: ";
my $file = <STDIN>;
chomp $file;
if(!open(my $fileName, "<", "$file"))
{
print "Sorry, that file doesn't exist!", "\n";
}
else
{
my %phoneNums;
while (my $line=<$fileName>)
{
chomp($line);
(my $name,my $number) = split /:/, $line;
$phoneNums{$name} = $number;
}
print "Read in the file!", "\n\n";
print "Enter search: ";
my $input = <STDIN>;
chomp $input;
#HERE IS WHERE I'M LOST
}
print "\n";
This is the part I am stuck on:
Allow the user to enter a search string.
Look for matches using the same style as the phone. Any individual
character in the search string can match any other character from the
key, meaning a ‘2’ in the search string can match a ‘2’, ‘A’, ‘B’, or ‘C’ in the contact list. Matches can occur in the contact name or the phone number. For a match to occur, each character in the search string must appear, in order, in the contact info, but not necessarily next to each
other. For example, a search string of “86” (essentially the same as a search string of “TM” or “NU”) would match “TOM” but not “MOTHER”.
Characters on each phone keys:
0,
1,
2ABC,
3DEF,
4GHI,
5JKL,
6MNO,
7PQRS,
8TUV,
9WXYZ
I just am stuck on how exactly to make all those character classes, and any help at all is much appreciated.

The way to tackle this is by writing a function that reduces your 'things' to their common components. The best way to do this IMO is use a hash:
my %num_to_letter = (
0 => [],
1 => [],
2 => [ "A", "B", "C" ],
3 => [ "D", "E", "F" ],
4 => [ "G", "H", "I" ],
5 => [ "J", "K", "L" ],
## etc.
);
my %letter_to_num;
foreach my $key ( keys %num_to_letter ) {
foreach my $element ( #{$num_to_letter{$key}} ) {
$letter_to_num{lc($element)} = lc($key);
}
}
print Dumper \%letter_to_num;
This creates a map of which letters or numbers map to their original - a bit like this:
$VAR1 = {
'b' => '2',
'g' => '4',
'e' => '3',
'i' => '4',
'a' => '2',
'j' => '5',
...
Note - you can do this by hand, but I prefer to generate from the top map, because I think it looks neater. Note - we use lc to lower case everything, so this becomes case insensitive. It's probably worth looking at fc - which is a similar tool but handles international characters. (Not relevant in this example though)
You then 'reduce' both search and 'target' to their common values:
sub normalise {
my ( $input ) = #_;
#join with no delimiter.
return join ( '',
#look up $_ (each letter) in $letter_to_num
#if not present, use // operator to return original value.
#this means we get to turn numbers into letters,
#but leave things that are already numbers untouched.
map { $letter_to_num{lc($_)} // $_ }
#split the input line into characters.
split ( //, $input )
);
}
print normalise ("DAD"),"\n"; ## 323
And then compare one against the other:
my $search = "DAD";
my $normalised_search = normalise($search);
print "Searching for: \"$normalised_search\"\n";
my $number_to_match = '00533932388';
my $string_to_match = "daddyo";
print "Matches number\n"
if normalise($number_to_match) =~ m/$normalised_search/;
print "Matches string\n"
if normalise($string_to_match) =~ m/$normalised_search/;

Here's an almost procedural approach that cheats a bit by using Hash::MultiValue:
use Hash::MultiValue; # makes reversing and flattening easier
# build a hash from the phone_keypad array or do it manually!
my #phone_keypad = qw(0 1 2ABC 3DEF 4GHI 5JKL 6MNO 7PQRS 8TUV 9WXYZ);
my %num2let = map { /(\d{1})(\w{3,4})/;
if ($2) { $1 => [ split('',$2) ] } else { 0 => [] , 1 => [] }
} #phone_keypad ;
# Invert the hash using Hash::MultiValue
my $num2let_mv = Hash::MultiValue->from_mixed(\%num2let);
my %let2num = reverse $num2let_mv->flatten ;
# TOM in numbers - 866 in letters
my $letters = "TOM" ;
print join '', $let2num{$_} // $_ for (split('', $letters)), "\n";
my $phone_input = "866" ;
print join '', #{$num2let{$_}}," " for (split('', $phone_input)) , "\n";
Output:
866
TUV MNO MNO
So here "TOM" would overlap with "UNO" ... I like #Sobrique's answer :-)
To search an array/list of contact names using the phone keypad input we can create a hash containing keys and values of the names and their number equivalents and then match the "converted" name value against the input:
use Hash::MultiValue; # makes reversing and flattening easier
my #contacts = <DATA> ;
chomp #contacts;
# build a hash from the phone_keypad array or do it manually!
my #phone_keypad = qw(0 1 2ABC 3DEF 4GHI 5JKL 6MNO 7PQRS 8TUV 9WXYZ);
my %num2let = map { /(\d{1})(\w{3,4})/;
if ($2) { $1 => [ split('',$2) ] } else { 0 => [] , 1 => [] }
} #phone_keypad ;
# Invert the hash using Hasj::MultiValue
my $num2let_mv = Hash::MultiValue->from_mixed(\%num2let);
my %let2num = reverse $num2let_mv->flatten ;
# create key/value pairs for contact database
my %contacts2nums ;
for $contact (#contacts) {
$contacts2nums{$contact} = join "",
map { $let2num{$_} } split('', uc $contact);
}
my $phone_input = "866";
for my $contact (keys %contacts2nums) {
print "The text: \"$contact\" matches the input: \"$phone_input\" \n"
if $phone_input eq $contacts2nums{$contact};
}
__DATA__
Tom
Mother
TIMTOWDI
DAD
Gordon
Output:
The text: "Tom" matches the input: "866"
A more organized approach would wrap the conversion operation in a function.
Addendum:
With a real keypad you could probably come up with a simple algorithm that could be more deterministic regarding the letter you want to associate with the number on the keypad. You could iterate through the array based on number of presses of the key: e.g. two presses on "2" would be equal to "B", etc. You'd just have to figure out how/when to move to the next character with some kind of timeout/wait value. This way you would have a more exact string on which to base your search.

Related

Perl remove characters from string matrix according to pattern

I have multiple strings with the same length stored in a hash data structure. Example:
$VAR1 = {
'first' => 'abcXY',
'second' => 'XYXYa',
'third' => '*abXZ'
};
From this 'matrix' of characters, I would like to remove the 'columns' which exclusively contain the characters X or Y. In the above example, this would be the fourth character of each string (4th 'column').
The desired result would be:
$VAR1 = {
'first' => 'abcY',
'second' => 'XYXa',
'third' => '*abZ'
};
Following code does this by creating a transpose of the values of my hash structure and then determines which indices to keep:
# data structure
my %h = ('first'=>'abcXY', 'second'=>'XYXYa', 'third'=>'*abXZ' );
# get length of all values in hash
my $nchar = length $h{(keys(%h))[0]};
# transpose values of hash
my #transposed = map { my $idx=$_; [map {substr ($_, $idx, 1) } values(%h)] } 0..$nchar-1;
# determine indices which I want to keep
my #indices;
for my $i (0..$#transposed){
my #a = #{$transposed[$i]};
# do not keep index if column consists of X and Y
if ( scalar(grep {/X|Y/} #a) < scalar(#a) ) {
push #indices, $i;
}
}
# only keep letters with indices
for my $k (keys %h){
my $str = $h{$k};
my $reduced = join "", map{ substr ($str, $_, 1) } #indices;
$h{$k} = $reduced;
}
This is a terrible amount of code for such a simple operation. How could I do this more elegantly (preferably not with some matrix library, but with standard perl)?
Edit
Here another example: From the following strings, the first and last characters should be removed, because in both strings, the first and last position is either X or Y:
$VAR1 = {
'1' => 'Xsome_strX',
'2' => 'YsomeXstrY'
};
Desired result:
$VAR1 = {
'1' => 'some_str',
'2' => 'someXstr'
};
my $total = values %hash;
my %ind;
for my $v (values %hash) {
$ind{ pos($v) -1 }++ while $v =~ /[XY]/g;
}
my #to_remove = sort {$b <=> $a} grep { $ind{$_} == $total } keys %ind;
for my $v (values %hash) {
substr($v, $_, 1, "") for #to_remove;
}

Search for String in File Using Regular Expression Perl

I'm new to Perl, I'm reading text from a file and want to REPLACE some words with their translation in French. I managed to get word by word, but not by expression/string, I'm having problems getting it code wise.
Code for word by word:
my $filename = 'assign3.txt';
my #lexicon_en = ("Winter","Date", "Due Date", "Problem", "Summer","Mark","Fall","Assignment","November");
my #lexicon_fr = ("Hiver", "Date", "Date de Remise","Problème","Été", "Point", "Automne", "Devoir", "Novembre");
my $i=1;
open(my $fh, '<:encoding(UTF-8)', $filename)
or die "Could not open file $filename !";
while (<$fh>) {
for my $word (split)
{
print " $i. $word \n";
$i++;
for (my $j=0; $j < 9;$j++){
if ($word eq $lexicon_en[$j]){
print "Found one! - j value is $j\n";
}
}
}
}
print "\ndone here!!\n";
Here is the regular expression I'm trying to use:
/\w+\s\w+/
This is my code for strings:
while (<>) {
print ("this is text: $_ \n");
if ((split (/Due\sDate/),$_) eq "Due Date"){
print "yes!!\n";
}
}
I think I understand the challenge you're having. Because "due date" is two words you need it to match before "due" matches otherwise you get several incorrect translations. One way to deal with that would be to order your matches by the largest number of words to the fewest so that "due date" is dealt with before "due".
If you convert your arrays to a hash (dictionary) you can order the keys based on the number of spaces and then iterate over them to do the actual substitutions:
#!/usr/bin/perl
use strict;
use warnings;
#my #lexicon_en = ("Winter","Date", "Due Date", "Problem", "Summer","Mark","Fall","Assignment","November");
#my #lexicon_fr = ("Hiver", "Date", "Date de Remise","Problème","Été", "Point", "Automne", "Devoir", "Novembre");
# convert your arrays to a hash
my %lexicon = (
'Winter' => 'Hiver',
'Date' => 'Date',
'Due Date' => 'Date de Remise',
'Problem' => 'Problème',
'Summer' => 'Été',
'Mark' => 'Point',
'Fall' => 'Automne',
'Assignment' => 'Devoir',
'November' => 'Novembre',
);
# sort the keys on the number of spaces found
my #ordered_keys = sort { ($a =~ / /g) < ($b =~ / /g) } keys %lexicon;
my $sample = 'The due date of the assignment is a date in the fall.';
print "sample before: $sample\n";
foreach my $key (#ordered_keys) {
$sample =~ s/${key}/${lexicon{${key}}}/ig;
}
print "sample after : $sample\n";
The output:
sample before: The due date of the assignment is a date in the fall.
sample after : The Date de Remise of the Devoir is a Date in the Automne.
The next challenge is going to be ensuring that the case of the replacement matches what's being replaced.
Use \b to detect word boundary instead of \w to detect whitespace.
Combine the solution of Steven Klassen with
How to replace a set of search/replace pairs?
#!/usr/bin/perl
use strict;
use warnings;
my %lexicon = (
'Winter' => 'Hiver',
'Date' => 'Date',
'Due Date' => 'Date de Remise',
'Problem' => 'Problème',
'Summer' => 'Été',
'Mark' => 'Point',
'Fall' => 'Automne',
'Assignment' => 'Devoir',
'November' => 'Novembre',
);
# add lowercase
for (keys %lexicon) {
$lexicon{lc($_)} = lc($lexicon{$_});
print $_ . " " . $lexicon{lc($_)} . "\n";
}
# Combine to one big regexp.
# https://stackoverflow.com/questions/17596917/how-to-replace-a-set-of-search-replace-pairs?answertab=votes#tab-top
my $regexp = join '|', map { "\\b$_\\b" } keys %lexicon;
my $sample = 'The due date of the assignment is a date in the fall.';
print "sample before: $sample\n";
$sample =~ s/($regexp)/$lexicon{$1}/g;
print "sample after : $sample\n";

Specify number of matching regex groups using Perl

Let's say I have the following string
my $val = "3.4 -22.352 4.0"
The goal is to extract each decimal number by itself. There can be any number of spaces on each side or in between. It is also important to make sure that there is exactly 3 numbers present, and no other junk. I have something like this, but it doesn't work:
my #parts = ($val =~ /((\s*[-+]?\d{1,3}\.\d{1,3}\s*)){3}/)
if (scalar(#parts) == 3) {
print "Validated!\n";
for my $i (#parts) {
print "$i\n";
}
}
For some reason I get the last one twice.
Each capturing group gets you only one value, even if you apply a quantifier on it. If you want 3 values you have to repeat the capturing group 3 times. For example:
my $num = qr/[-+]?\d{1,3}\.\d{1,3}/;
my #nums = $val =~ /^\s*($num)\s+($num)\s+($num)\s*$/;
if(#nums){
print "Valid, and no need to check the number of elements.\n";
}
Instead of fighting regular expressions, use split and looks_like_number:
use warnings;
use strict;
use Scalar::Util qw(looks_like_number);
my $val = "3.4 -22.352 4.0";
my #parts = split /\s+/, $val;
if (scalar(#parts) == 3) {
my $ok = 0;
for (#parts) {
$ok++ if looks_like_number($_);
}
if ($ok == 3) {
print "Validated!\n";
for my $i (#parts) {
print "$i\n";
}
}
}
There are several issues here:
1) If you want three and only three numbers, you should anchor the start (^) and end ($) of the line in the regex.
2) Why are there two sets of parentheses? As written the second pair are redundant.
3) When you have a regex, the number of values returned are usually counted by the left parentheses (unless you use ?: or some other modifier). In this example, you have two, so it only returns two values. Because of the redundant parentheses, you get the same values twice each.
You have two sets of parens, so two values are returned. Both sets surround the same part of the regex, so both values will be the same.
Validating and extracting at not necessarily possible to do at the same time.
Doing it in two steps, extracting first, is quite simple:
my #nums = split ' ', $val;
die "Invalid\n" if #parts != 3;
for (#nums) {
die "Invalid\n" if !/^[-+]?[0-9]{1,3}\.[0-9]{1,3}\z/;
}
You can do it in one step, but there's some redundancy involved:
my $num_pat = qr/[-+]?[0-9]{1,3}\.[0-9]{1,3}/;
my #nums = $val =~ /^($num_pat)\s+($num_pat)\s+($num_pat)\z/
or die "Invalid\n";
my $val = "3.4 -22.352 4.0";
my $length = $val =~ s/((^|\s)\S)/$1/g;
#determines the number of tokens
if ($length == 3)
{
while($val=~/([-+]?[0-9]{1,3}\.[0-9]{1,3})/g)
{
print "$1\n";
}
}
The /g allows you to loop through the string and extract values conforming to your restrictions (one at a time). It will do this until all of the "tokens" matching your pattern are iterated through. I like this solution because it's concise and doesn't require you to create an auxiliary array. It's also a more general answer than using three extractions in one's regex.
With Regex Only
This will require 3 chunks of numbers delimited by space each number will be popluated into it's respective group.
(?:(?:^)([-]?[0-9]*?[.]?[0-9]*?)(?=\s))(?:(?:\s)([-]?[0-9]*?[.]?[0-9]*?)(?=\s))(?:(?:\s)([-]?[0-9]*?[.]?[0-9]*?)(?=$))
Example
PHP Code Example:
<?php
$sourcestring="3.4 -22.352 4.0";
preg_match_all('/(?:(?:^)([-]?[0-9]*?[.]?[0-9]*?)(?=\s))(?:(?:\s)([-]?[0-9]*?[.]?[0-9]*?)(?=\s))(?:(?:\s)([-]?[0-9]*?[.]?[0-9]*?)(?=$))/i',$sourcestring,$matches);
echo "<pre>".print_r($matches,true);
?>
$matches Array:
(
[0] => Array
(
[0] => 3.4 -22.352 4.0
)
[1] => Array
(
[0] => 3.4
)
[2] => Array
(
[0] => -22.352
)
[3] => Array
(
[0] => 4.0
)
)

Perl regex with a negative lookahead behaves unexpectedly

I'm attempting to match /ezmlm-(any word except 'weed' or 'return')\s+/ with a regex. The following demonstrates a foreach loop which does the right thing, and an attempted regex which almost does:
#!/usr/bin/perl
use strict;
use warnings;
my #tests = (
{ msg => "want 'yes', string has ezmlm, but not weed or return",
str => q[|/usr/local/bin/ezmlm-reject '<snip>'],
},
{ msg => "want 'yes', array has ezmlm, but not weed or return",
str => [ <DATA> ],
},
{ msg => "want 'no' , has ezmlm-weed",
str => q[|/usr/local/bin/ezmlm-weed '<snip>'],
},
{ msg => "want 'no' , doesn't have ezmlm-anything",
str => q[|/usr/local/bin/else '<snip>'],
},
{ msg => "want 'no' , ezmlm email pattern",
str => q[crazy/but/legal/ezmlm-wacky#example.org],
},
);
print "foreach regex\n";
foreach ( #tests ) {
print doit_fe( ref $_->{str} ? #{$_->{str}} : $_->{str} ) ? "yes" : "no";
print "\t";
print doit_re( ref $_->{str} ? #{$_->{str}} : $_->{str} ) ? "yes" : "no";
print "\t<--- $_->{msg}\n";
};
# for both of the following subs:
# #_ will contain one or more lines of data
# match the pattern /ezmlm-(any word except 'weed' or 'return')\s+/
sub doit_fe {
my $has_ezmlm = 0;
foreach ( #_ ) {
next if $_ !~ m/ezmlm-(.*?)\s/;
return 0 if $1 eq 'weed' or $1 eq 'return';
$has_ezmlm++;
};
return $has_ezmlm;
};
sub doit_re { return grep /ezmlm-(?!weed|return)/, #_; };
__DATA__
|/usr/local/bin/ezmlm-reject '<snip>'
|/usr/local/bin/ezmlm-issubn '<snip>'
|/usr/local/bin/ezmlm-send '<snip>'
|/usr/local/bin/ezmlm-archive '<snip>'
|/usr/local/bin/ezmlm-warn '<snip>'
The output of the sample program is as follows:
foreach regex
yes yes <--- want 'yes', string has ezmlm, but not weed or return
yes yes <--- want 'yes', array has ezmlm, but not weed or return
no no <--- want 'no' , has ezmlm-weed
no no <--- want 'no' , doesn't have ezmlm-anything
no yes <--- want 'no' , ezmlm email pattern
In the last instance, the regex fails, matching a goofy but legal email address. If I amend the regex placing a \s after the negative lookahead pattern like so:
grep /ezmlm-(?!weed|return)\s+/
The regex fails to match at all. I'm supposing it has to do with the how the negative pattern works. I've tried making the negation non-greedy, but it seems there's some lesson buried in 'perldoc perlre' that is escaping me. Is it possible to do this with a single regex?
The negative look-ahead is zero-width which means that the regex
/ezmlm-(?!weed|return)\s+/
will only match if one or more space characters immediately follow "ezmlm-".
The pattern
/ezmlm-(?!weed|return)/
will match
"crazy/but/legal/ezmlm-wacky#example.org"
because it contains "ezmlm-" not followed by "weedy" or "return".
Try
/ezmlm-(?!weed|return)\S+\s+/
where \S+ is one or more non-space characters (or instead use [^#\s]+ if you want to deny email addresses even if followed by a space).

Perl 5 - longest token matching in regexp (using alternation)

Is possible to force a Perl 5 regexp match longest possible string, if the regexp is, for example:
a|aa|aaa
I found is probably default in perl 6, but in perl 5, how i can get this behavior?
EXAMPLE pattern:
[0-9]|[0-9][0-9]|[0-9][0-9][0-9][0-9]
If I have string 2.10.2014, then first match will be 2, which is ok; but the next match will be 1, and this is not ok because it should be 10. Then 2014 will be 4 subsequently matches 2,0,1,4, but it should be 2014 using [0-9][0-9][0-9][0-9]. I know I could use [0-9]+, but I can't.
General solution: Put the longest one first.
my ($longest) = /(aaa|aa|a)/
Specific solution: Use
my ($longest) = /([0-9]{4}|[0-9]{1,2})/
If you can't edit the pattern, you'll have to find every possibility and find the longest of them.
my $longest;
while (/([0-9]|[0-9][0-9]|[0-9][0-9][0-9][0-9])/g) {
$longest = $1 if length($1) > length($longest);
}
The sanest solution I can see for unknown patterns is to match every possible pattern, look at the length of the matched substrings and select the longest substring:
my #patterns = (qr/a/, qr/a(a)/, qr/b/, qr/aaa/);
my $string = "aaa";
my #substrings = map {$string =~ /($_)/; $1 // ()} #patterns;
say "Matched these substrings:";
say for #substrings;
my $longest_token = (sort { length $b <=> length $a } #substrings)[0];
say "Longest token was: $longest_token";
Output:
Matched these substrings:
a
aa
aaa
Longest token was: aaa
For known patterns, one would sort them manually so that first-match is the same as longest-match:
"aaa" =~ /(aaa|aa|b|a)/;
say "I know that this was the longest substring: $1";
The alternation will use the first alternative that matches, so just write /aaa|aa|a/ instead.
For the example you have shown in your question, just put the longest alternative first like I said:
[0-9][0-9][0-9][0-9]|[0-9][0-9]|[0-9]
perl -Mstrict -Mre=/xp -MData::Dumper -wE'
{package Data::Dumper;our($Indent,$Sortkeys,$Terse,$Useqq)=(1)x4}
sub _dump { Dumper(shift) =~ s{(\[.*?\])}{$1=~s/\s+/ /gr}srge }
my ($count, %RS);
my $s= "aaaabbaaaaabbab";
$s =~ m{ \G a+b? (?{ $RS{ $+[0] - $-[0] } //= [ ${^MATCH}, $-[0] ]; $count++ }) (*FAIL) };
say sprintf "RS: %s", _dump(\%RS);
say sprintf "count: %s", $count;
'
RS: {
"1" => [ "a", 0 ],
"2" => [ "aa", 0 ],
"3" => [ "aaa", 0 ],
"4" => [ "aaaa", 0 ],
"5" => [ "aaaab", 0 ]
}
count: 5