Search for String in File Using Regular Expression Perl - regex

I'm new to Perl, I'm reading text from a file and want to REPLACE some words with their translation in French. I managed to get word by word, but not by expression/string, I'm having problems getting it code wise.
Code for word by word:
my $filename = 'assign3.txt';
my #lexicon_en = ("Winter","Date", "Due Date", "Problem", "Summer","Mark","Fall","Assignment","November");
my #lexicon_fr = ("Hiver", "Date", "Date de Remise","Problème","Été", "Point", "Automne", "Devoir", "Novembre");
my $i=1;
open(my $fh, '<:encoding(UTF-8)', $filename)
or die "Could not open file $filename !";
while (<$fh>) {
for my $word (split)
{
print " $i. $word \n";
$i++;
for (my $j=0; $j < 9;$j++){
if ($word eq $lexicon_en[$j]){
print "Found one! - j value is $j\n";
}
}
}
}
print "\ndone here!!\n";
Here is the regular expression I'm trying to use:
/\w+\s\w+/
This is my code for strings:
while (<>) {
print ("this is text: $_ \n");
if ((split (/Due\sDate/),$_) eq "Due Date"){
print "yes!!\n";
}
}

I think I understand the challenge you're having. Because "due date" is two words you need it to match before "due" matches otherwise you get several incorrect translations. One way to deal with that would be to order your matches by the largest number of words to the fewest so that "due date" is dealt with before "due".
If you convert your arrays to a hash (dictionary) you can order the keys based on the number of spaces and then iterate over them to do the actual substitutions:
#!/usr/bin/perl
use strict;
use warnings;
#my #lexicon_en = ("Winter","Date", "Due Date", "Problem", "Summer","Mark","Fall","Assignment","November");
#my #lexicon_fr = ("Hiver", "Date", "Date de Remise","Problème","Été", "Point", "Automne", "Devoir", "Novembre");
# convert your arrays to a hash
my %lexicon = (
'Winter' => 'Hiver',
'Date' => 'Date',
'Due Date' => 'Date de Remise',
'Problem' => 'Problème',
'Summer' => 'Été',
'Mark' => 'Point',
'Fall' => 'Automne',
'Assignment' => 'Devoir',
'November' => 'Novembre',
);
# sort the keys on the number of spaces found
my #ordered_keys = sort { ($a =~ / /g) < ($b =~ / /g) } keys %lexicon;
my $sample = 'The due date of the assignment is a date in the fall.';
print "sample before: $sample\n";
foreach my $key (#ordered_keys) {
$sample =~ s/${key}/${lexicon{${key}}}/ig;
}
print "sample after : $sample\n";
The output:
sample before: The due date of the assignment is a date in the fall.
sample after : The Date de Remise of the Devoir is a Date in the Automne.
The next challenge is going to be ensuring that the case of the replacement matches what's being replaced.

Use \b to detect word boundary instead of \w to detect whitespace.
Combine the solution of Steven Klassen with
How to replace a set of search/replace pairs?
#!/usr/bin/perl
use strict;
use warnings;
my %lexicon = (
'Winter' => 'Hiver',
'Date' => 'Date',
'Due Date' => 'Date de Remise',
'Problem' => 'Problème',
'Summer' => 'Été',
'Mark' => 'Point',
'Fall' => 'Automne',
'Assignment' => 'Devoir',
'November' => 'Novembre',
);
# add lowercase
for (keys %lexicon) {
$lexicon{lc($_)} = lc($lexicon{$_});
print $_ . " " . $lexicon{lc($_)} . "\n";
}
# Combine to one big regexp.
# https://stackoverflow.com/questions/17596917/how-to-replace-a-set-of-search-replace-pairs?answertab=votes#tab-top
my $regexp = join '|', map { "\\b$_\\b" } keys %lexicon;
my $sample = 'The due date of the assignment is a date in the fall.';
print "sample before: $sample\n";
$sample =~ s/($regexp)/$lexicon{$1}/g;
print "sample after : $sample\n";

Related

How can I translate non printable ascii chars to readable text with Perl

I'm trying to test some probes connected via USB on an Linux device using Perl 5.28 and Linux (Debian 8). When I read out a large file buffer of the probe, often none readable ASCII signs occur like \0 or \x02. I want to translate these signs into readable tagged text. I've written a small subroutine, but it seems to me a little bit clunky for large translation list to test every entry. Is there a better way to do that?
Example script
#!/usr/bin/env perl -w
# test-escape.pl --- test none readable chars
use strict;
sub escBuf() {
my $buf = shift;
my #numNul = $buf =~ /\0/g;
my #numCR = $buf =~ /\r/g;
$buf =~ s/\r/\n/g;
$buf =~ s/\x00/<NUL>/g;
$buf =~ s/\x01/<SOH>/g;
$buf =~ s/\x02/<STX>/g;
$buf =~ s/\x03/<ETX>/g;
$buf =~ s/\x04/<EOT>/g;
$buf =~ s/\x05/<ENQ>/g;
$buf =~ s/\x06/<ACK>/g;
$buf =~ s/\x07/<BEL>/g;
$buf =~ s/\x08/<BS>/g;
$buf =~ s/\x0B/<VT>/g;
$buf =~ s/\x0C/<FF>/g;
$buf =~ s/\x0E/<SO>/g;
$buf =~ s/\x0F/<SI>/g;
my $numNUL = #numNul;
my $numCR = #numCR;
return ($buf, $numNUL, $numCR);
}
# Buffer example
my $buffer = "\x01\r\x02This is a test with\r\n ".
"sometimes qiurks \0 inside \x0C stuff \0 and regular \x03\r\x04";
# Translate output
my ($out, $numNUL, $numCR) = &escBuf($buffer);
# Not printed correctly due to \0
# print "ORG.TEXT: '$buffer' \n\n";
# Result of the translation
print "ESC.TEXT: '$out' \n\n";
print "NUM.NUL: $numNUL\n";
print "NUM.CR: $numCR\n\n";
Result
/usr/bin/env perl -w "test-escape.pl"
ESC.TEXT: '<SOH>
<STX>This is a test with
sometimes qiurks <NUL> inside <FF> stuff <NUL> and regular <ETX>
<EOT>'
NUM.NUL: 2
NUM.CR: 3
EDIT: Adopted code with proposed solution by ikegami
#!/usr/bin/env perl -w
# test-escape.pl --- test none readable chars
use strict;
# Dictionary of non printable signs
my %NONE_ASC_DICT = (
"\x00" => "NUL", "\x01" => "SOH", "\x02" => "STX", "\x03" => "ETX",
"\x04" => "EOT", "\x05" => "ENQ", "\x06" => "ACK", "\x07" => "BEL",
"\x08" => "BS",
# Essenital for parsing "\x09" => "TAB" "\x0a" => "LF"
"\x0b" => "VT", "\x0c" => "FF", "\x0d" => "CR",
"\x0e" => "SO", "\x0f" => "SI",
"\x10" => "DLE",
"\x11" => "DC1", "\x12" => "DC2", "\x13" => "DC3", "\x14" => "DC4",
"\x15" => "NAK", "\x16" => "SYN", "\x17" => "ETB", "\x18" => "CAN",
"\x19" => "EM", "\x1A" => "SUB", "\x1B" => "ESC", "\x1C" => "FS",
"\x1D" => "GS", "\x1E" => "RS", "\x1F" => "US", "\x7F" => "DEL",
);
# Mapping of the entries and corresponding predefined REGEX
my $NONE_ASC_CLASS = join "", map quotemeta, keys(%NONE_ASC_DICT);
my $NONE_ASC_REGEX = qr/([$NONE_ASC_CLASS])/;
# Translator subroutine
sub escBuffer() {
my ($buf, $dict, $regex, $prefix, $suffix) = #_;
# Set default sprefix suffix strings if not present
$prefix //= '<'; $suffix //= '>';
# Count the real quirks
my #numNUL = $buf =~ /\0/g;
my $numNUL = #numNUL;
# Clean up mixed UNIX / DOS context
$buf =~ s/\r\n/\n/g;
$buf =~ s/\r/\n/g; # translate all remaining \r to \n
# Calc resulting number of lines
my #numLF = $buf =~ /\n/g;
my $numLF = #numLF;
# Translate the remaining non printables
$buf =~ s/$regex/ $prefix.$dict->{$1}.$suffix /eg;
# Result set translated buffer, count quirks, count lines
return ($buf, $numNUL, $numCR);
}
# Buffer example
my $buffer = "\x01\r\x02This is a test with\r\n ".
"sometimes qiurks \0 inside \x0C stuff \0 and regular \x03\r\x04";
# Translate output
my ($out, $numNUL, $numLF) = &escBuffer
($buffer, \%NONE_ASC_DICT, $NONE_ASC_REGEX);
# Result of the translation
print "ESC.TEXT: '$out' \n\n";
print "NUM.NUL: $numNUL\n";
print "NUM.LF: $numLF\n\n";
Use a table.
Setup:
my %map = (
"\x00" => "<NUL>",
...,
);
my $class = join "", map quotemeta, keys(%map);
my $re = qr/([$class])/;
Replacing:
s/$re/$map{$1}/g

Get all occurances of matches pattern perl

if we are in the following case:
my $str = <<EO_STR;
Name=Value1 Adress=Value4
Name=Value2 Adress=Value5
Name=Value3 Adress=Value6
EO_STR
I have a table "T1" in the database with columns: ("Name", "Address") and I want to put on the column "Name" values "value1,Value2,Value3" and on the column "Adress" values "Value4,Value5,Value6"
in this case we have :
my #matches = $str =~ /Name=(.*?)\nAdress=(.*?)\n/g;
how can we use $1 and $2 with #matches in order to get separately all occurence of Name and Adresse in order to insert them on the Table T1?
All captures of all matches are returned, so you'd have to group them up.
use List::Util 1.29 qw( pairs );
for ( pairs( $str =~ /Name=(.*) Address=(.*)/g ) ) {
my #matches = #$_;
...
}
That said, it's far more common to grab the matches iteratively.
while ($str =~ /Name=(.*) Address=(.*)/g) {
my #matches = ( $1, $2 );
...
}
Regex is not always the right tool for the job. Your data looks a lot like it's just key/value pairs. Use split to break it up. No need for a pattern match here.
Your code and data doesn't match, so I've gone with what the code said.
use strict;
use warnings;
my $str = <<EO_STR;
Name=Value1
Adress=Value4
Name=Value2
Adress=Value5
Name=Value3
Adress=Value6
EO_STR
my $fields;
foreach my $pair (split /\n/, $str) {
my ($key, $value) = split /=/, $pair;
$key =~ s/^\s+//;
push #{ $fields->{$key} }, $value;
}
use Data::Dumper;
print Dumper $fields;
The code will create this data structure:
$VAR1 = {
'Name' => [
'Value1',
'Value2',
'Value3'
],
'Adress' => [
'Value4',
'Value5',
'Value6'
]
};
You can now access these two array references and use them to insert data into your table.
I have done the following:
#!/usr/bin/env perl
use v5.28;
my $str = <<EO_STR;
Name=Value1 Adress=Value4
Name=Value2 Adress=Value5
Name=Value3 Adress=Value6
EO_STR
my #array;
for my $a (split(/\n/, $str)) {
my %res = $a =~ m/(\w+)=(\w+)/g;
push #array, \%res;
}
for my $a (#array) {
for my $b (sort keys %{$a}) {
"\n", <INPUT_FILE> ); say $b.'->'.$a->{$b};
}
}
It creates this structure:
#array = [
{
Name->Value1,
Adress->Value4
},
...
];

Perl - Using regex to match input in hash key or value

First, this is a homework assignment. I am having a tough time with regex, and I'm stuck.
This is the code I have so far, where I have the user designate a filename, and if it exists, populates a hash of the names as keys, and the phone numbers as the values.
#!/usr/bin/perl
use strict;
print "\nEnter Filename: ";
my $file = <STDIN>;
chomp $file;
if(!open(my $fileName, "<", "$file"))
{
print "Sorry, that file doesn't exist!", "\n";
}
else
{
my %phoneNums;
while (my $line=<$fileName>)
{
chomp($line);
(my $name,my $number) = split /:/, $line;
$phoneNums{$name} = $number;
}
print "Read in the file!", "\n\n";
print "Enter search: ";
my $input = <STDIN>;
chomp $input;
#HERE IS WHERE I'M LOST
}
print "\n";
This is the part I am stuck on:
Allow the user to enter a search string.
Look for matches using the same style as the phone. Any individual
character in the search string can match any other character from the
key, meaning a ‘2’ in the search string can match a ‘2’, ‘A’, ‘B’, or ‘C’ in the contact list. Matches can occur in the contact name or the phone number. For a match to occur, each character in the search string must appear, in order, in the contact info, but not necessarily next to each
other. For example, a search string of “86” (essentially the same as a search string of “TM” or “NU”) would match “TOM” but not “MOTHER”.
Characters on each phone keys:
0,
1,
2ABC,
3DEF,
4GHI,
5JKL,
6MNO,
7PQRS,
8TUV,
9WXYZ
I just am stuck on how exactly to make all those character classes, and any help at all is much appreciated.
The way to tackle this is by writing a function that reduces your 'things' to their common components. The best way to do this IMO is use a hash:
my %num_to_letter = (
0 => [],
1 => [],
2 => [ "A", "B", "C" ],
3 => [ "D", "E", "F" ],
4 => [ "G", "H", "I" ],
5 => [ "J", "K", "L" ],
## etc.
);
my %letter_to_num;
foreach my $key ( keys %num_to_letter ) {
foreach my $element ( #{$num_to_letter{$key}} ) {
$letter_to_num{lc($element)} = lc($key);
}
}
print Dumper \%letter_to_num;
This creates a map of which letters or numbers map to their original - a bit like this:
$VAR1 = {
'b' => '2',
'g' => '4',
'e' => '3',
'i' => '4',
'a' => '2',
'j' => '5',
...
Note - you can do this by hand, but I prefer to generate from the top map, because I think it looks neater. Note - we use lc to lower case everything, so this becomes case insensitive. It's probably worth looking at fc - which is a similar tool but handles international characters. (Not relevant in this example though)
You then 'reduce' both search and 'target' to their common values:
sub normalise {
my ( $input ) = #_;
#join with no delimiter.
return join ( '',
#look up $_ (each letter) in $letter_to_num
#if not present, use // operator to return original value.
#this means we get to turn numbers into letters,
#but leave things that are already numbers untouched.
map { $letter_to_num{lc($_)} // $_ }
#split the input line into characters.
split ( //, $input )
);
}
print normalise ("DAD"),"\n"; ## 323
And then compare one against the other:
my $search = "DAD";
my $normalised_search = normalise($search);
print "Searching for: \"$normalised_search\"\n";
my $number_to_match = '00533932388';
my $string_to_match = "daddyo";
print "Matches number\n"
if normalise($number_to_match) =~ m/$normalised_search/;
print "Matches string\n"
if normalise($string_to_match) =~ m/$normalised_search/;
Here's an almost procedural approach that cheats a bit by using Hash::MultiValue:
use Hash::MultiValue; # makes reversing and flattening easier
# build a hash from the phone_keypad array or do it manually!
my #phone_keypad = qw(0 1 2ABC 3DEF 4GHI 5JKL 6MNO 7PQRS 8TUV 9WXYZ);
my %num2let = map { /(\d{1})(\w{3,4})/;
if ($2) { $1 => [ split('',$2) ] } else { 0 => [] , 1 => [] }
} #phone_keypad ;
# Invert the hash using Hash::MultiValue
my $num2let_mv = Hash::MultiValue->from_mixed(\%num2let);
my %let2num = reverse $num2let_mv->flatten ;
# TOM in numbers - 866 in letters
my $letters = "TOM" ;
print join '', $let2num{$_} // $_ for (split('', $letters)), "\n";
my $phone_input = "866" ;
print join '', #{$num2let{$_}}," " for (split('', $phone_input)) , "\n";
Output:
866
TUV MNO MNO
So here "TOM" would overlap with "UNO" ... I like #Sobrique's answer :-)
To search an array/list of contact names using the phone keypad input we can create a hash containing keys and values of the names and their number equivalents and then match the "converted" name value against the input:
use Hash::MultiValue; # makes reversing and flattening easier
my #contacts = <DATA> ;
chomp #contacts;
# build a hash from the phone_keypad array or do it manually!
my #phone_keypad = qw(0 1 2ABC 3DEF 4GHI 5JKL 6MNO 7PQRS 8TUV 9WXYZ);
my %num2let = map { /(\d{1})(\w{3,4})/;
if ($2) { $1 => [ split('',$2) ] } else { 0 => [] , 1 => [] }
} #phone_keypad ;
# Invert the hash using Hasj::MultiValue
my $num2let_mv = Hash::MultiValue->from_mixed(\%num2let);
my %let2num = reverse $num2let_mv->flatten ;
# create key/value pairs for contact database
my %contacts2nums ;
for $contact (#contacts) {
$contacts2nums{$contact} = join "",
map { $let2num{$_} } split('', uc $contact);
}
my $phone_input = "866";
for my $contact (keys %contacts2nums) {
print "The text: \"$contact\" matches the input: \"$phone_input\" \n"
if $phone_input eq $contacts2nums{$contact};
}
__DATA__
Tom
Mother
TIMTOWDI
DAD
Gordon
Output:
The text: "Tom" matches the input: "866"
A more organized approach would wrap the conversion operation in a function.
Addendum:
With a real keypad you could probably come up with a simple algorithm that could be more deterministic regarding the letter you want to associate with the number on the keypad. You could iterate through the array based on number of presses of the key: e.g. two presses on "2" would be equal to "B", etc. You'd just have to figure out how/when to move to the next character with some kind of timeout/wait value. This way you would have a more exact string on which to base your search.

Perl: Using split but ignore quotes

I'm trying to create a Perl hash from an input string, but I'm having problems with the original 'split', as values may contain quotes. Below is an example input string, and my (desired) resulting hash:
my $command = 'CREATE:USER:TEL,12345678:MOB,444001122:Type,Whatever:ATTRIBUTES,"ID,0,MOB,123,KEY,VALUE":TIME,"08:01:59":FIN,0';
my %hash =
(
CREATE => '',
USER => '',
TEL => '12345678',
MOB => '444001122',
Type => 'Whatever',
ATTRIBUTES => 'ID,0,MOB,123,KEY,VALUE',
TIME => '08:01:59',
FIN => '0',
);
The input string is of arbitrary length, and the number of keys is not set.
Thanks!
-hq
Use Text::CSV. It handles comma separated value files correctly.
Update
It seems the format of your input is not parsable by the standard module, even with sep_char and allow_loose_quotes. So, you have to do the heavy lifting yourself, but you can still use Text::CSV to parse each key-value pair:
#!/usr/bin/perl
use warnings;
use strict;
use feature qw(say);
use Data::Dumper;
use Text::CSV;
my $command = 'CREATE:USER:TEL,12345678:MOB,444001122:Type,Whatever:ATTRIBUTES,"ID,0,KEY,VALUE":TIME,"08:01:59":FIN,0';
my #fields = split /:/, $command;
my %hash;
my $csv = Text::CSV->new();
my $i = 0;
while ($i <= $#fields) {
if (1 == $fields[$i] =~ y/"//) {
my $j = $i;
$fields[$i] .= ':' . $fields[$j] until 1 == $fields[++$j] =~ y/"//;
$fields[$i] .= ':' . $fields[$j];
splice #fields, $i + 1, $j - $i, ();
}
$csv->parse($fields[$i]);
my ($key, $value) = $csv->fields;
$hash{$key} = "$value"; # quotes turn undef to q()
$i++;
}
print Dumper \%hash;
As far as I can see the most obvious candidate - Text::CSV - won't handle this format properly, so a home-grown regular expression solution is the only one.
use strict;
use warnings;
my $command = 'CREATE:USER:TEL,12345678:MOB,444001122:Type,Whatever:ATTRIBUTES,"ID,0,KEY,VALUE":TIME,"08:01:59":FIN,0';
my %config;
for my $field ($command =~ /(?:"[^"]*"|[^:])+/g) {
my ($key, $val) = split /,/, $field, 2;
($config{$key} = $val // '') =~ s/"([^"]*)"/$1/;
}
use Data::Dumper;
print Data::Dumper->Dump([\%config], ['*config']);
output
%config = (
'TIME' => '08:01:59',
'MOB' => '444001122',
'Type' => 'Whatever',
'CREATE' => '',
'TEL' => '12345678',
'ATTRIBUTES' => 'ID,0,KEY,VALUE',
'USER' => '',
'FIN' => '0'
);
If you have Perl v5.10 or later then you have the convenient (?| ... ) regular expression group, which allows you to write this
use 5.010;
use warnings;
my $command = 'CREATE:USER:TEL,12345678:MOB,444001122:Type,Whatever:ATTRIBUTES,"ID,0,KEY,VALUE":TIME,"08:01:59":FIN,0';
my %config = $command =~ /(\w+) (?| , " ([^"]*) " | , ([^:"]*) | () )/gx;
use Data::Dumper;
print Data::Dumper->Dump([\%config], ['*config']);
which produces identical results to the code above.
This looks like something Text::ParseWords could handle. The quotewords subroutine will split the input on the delimiter :, ignoring delimiters inside quotes. This will give us the basic list of items, seen first in the output as $VAR1. After that, it is a simple matter of parsing the comma separated items with a regex which will handle optional second capture to accommodate empty tags such as those for CREATE and USER.
use strict;
use warnings;
use Data::Dumper;
use Text::ParseWords;
while (<DATA>) {
chomp;
my #list = quotewords(':', 0, $_);
my %hash = map { my ($k, $v) = /([^,]+),?(.*)/; $k => $v; } #list;
print Dumper \#list, \%hash;
}
__DATA__
CREATE:USER:TEL,12345678:MOB,444001122:Type,Whatever:ATTRIBUTES,"ID,0,KEY,VALUE":TIME,"08:01:59":FIN,0
Output:
$VAR1 = [
'CREATE',
'USER',
'TEL,12345678',
'MOB,444001122',
'Type,Whatever',
'ATTRIBUTES,ID,0,KEY,VALUE',
'TIME,08:01:59',
'FIN,0'
];
$VAR2 = {
'TIME' => '08:01:59',
'MOB' => '444001122',
'Type' => 'Whatever',
'CREATE' => '',
'TEL' => '12345678',
'ATTRIBUTES' => 'ID,0,KEY,VALUE',
'USER' => '',
'FIN' => '0'
};
my %hash = $command =~ /([^:,]+)(?:,((?:[^:"]|"[^"]*")*))?/g;
s/"([^"]*)"/$1/g
for grep defined, values %hash;

Match count in Perl

I have a hash with various keywords. Now, I want to find the count of these keywords in the string.
I just wrote some part of code with a foreach loop.
use strict;
use warnings;
my $string = "The invitro experiments are conducted on human liver microsom. "
. " These liver microsom can be cultured in rats.";
my %hash = (
"human" => 1,
"liver" => 1,
"microsom" => 1,
);
for my $nme (keys %hash){
# Some code which I am not sure
}
Expected output: human:1; liver:2; microsom:3
Can someone help me in this?
Thanks
The following snippet should suffice.
#!/usr/bin/perl -w
use strict;
my $string="The invitro experiments are conducted on human liver microsom. These liver microsom can be cultured in rats.";
my %hash = (
'human' => 1,
'liver' => 1,
'microsom' => 1,
);
my #words = split /\b/, $string;
my %seen;
for (#words) {
if ($_ eq 'human' or $_ eq 'liver' or $_ eq 'microsom') {
$seen{$_}++;
}
}
for (keys %hash) {
print "$_: $seen{$_}\n";
}
Is that homework? :) Well, depends on number of words in hash, and number of words in string ( strings ), it will be better iterate over hash, or iterate over words in string(s), incrementing appropriate values when found. As you need to check all the words, you will end with a list with some marked "0" occurences, and some marked more than zero.
probably not the best way of going about this, but it should work.
my $string = "The invitro experiments are conducted on human liver microsom. These liver microsom can be cultured in rats.";
my %hash = (
'human' => 1,
'liver' => 1,
'microsom' => 1,
);
foreach my $nme (keys %hash){
$hash{$nme} = scalar #{[$string =~ /$nme/g]};
print "$hash{$nme}\n";
}