Perl newbie: trying to find string in array of strings - regex

I need to match a string against an array of strings. The string that I am searching for should be able to contain wildcards.
#!/usr/bin/perl
#
## disable buffered I/O which would lead
## to deadloops for the Apache server
$| = 1;
#
## read URLs one per line from stdin
while (<>) {
my $line = $_;
my #array1 = ("abc","def","ghi");
$found = 0;
if (/$line/i ~~ #array1)
{
print "found\n";
}
else
{
print "not found\n";
}
}
I test this script with the input of abc and it returns not found
perl ./mapscript.pl
abc
not found

Your input has a newline at the end. Add:
chomp $line;
right after
my $line = $_;

Use chomp(my $input = $_) to remove newline instead of my $input = $_ inside your while..
** OOPs.. Didn't see that I'm posting Duplicate..

a newline at the end always exists using <>. see chomp

Related

Could anybody point out the mistake I am doing in this perl program?

As per the below code I should get bibhu printed once and j should be 1 if I enter "bibhu" at runtime. But It's not happening. Why am I getting j=0?
print "enter\n";
$find=<STDIN>;
$j=0;
#lines=qw(bibhu prasanna behera kuni shun jbjdkj);
foreach (#lines) {
if ($_ =~ /$find/) {
print "$_\n";
$j=$j+1;
}
}
print "$j\n";
You need to strip the newline from the input, just call chomp:
chomp($find = <STDIN>);
Without that, if you enter "bibhu" at the prompt, $find will be equal to "bibhu\n".
Here a slightly improved version of your script - as Lucas T. has written, you need to remove the newline at the end of entered string with chomp:
#!/usr/bin/perl -w
use strict;
print "enter\n";
chomp(my $find=(<STDIN>));
my $j=0;
my #lines=qw(bibhu prasanna behera kuni shun jbjdkj);
foreach (#lines) {
if (/$find/o) {
print "$_\n";
$j=$j+1;
}
}
print "$j\n";
I am thinking the other way round: You can add a \n to your $_ in the if:
...
if ("$_\n" =~ /$find/) {
...
The reason is your input contain with the newline character. So you should remove the newline from input or your regex to allow the newline.
Remove the newline from a input use chomp as mentioned other answer.
The other way is you should use x flag flag for to allow the whitespace or newline in your regex.
print "enter\n";
$find=<STDIN>;
$j=0;
#lines=qw(bibhu prasanna behera kuni shun jbjdkj);
foreach (#lines) {
if ($_ =~ /$find/x) {
print "$_\n";
$j=$j+1;
}
}
print "$j\n";
But better you use chomp

Using iterated variables with regex

The point of the overall script is to:
step 1) open a single column file and read off first entry.
step 2) open a second file containing lots of rows and columns, read off EACH line one at a time, and find anything in that line that matches the first entry from the first file.
step3) if a match is found, then "do something constructive", and if not, go to the first file and take the second entry and repeat step 2 and step 3, and so on...
here is the script:
#!/usr/bin/perl
use strict; #use warnings;
unless(#ARGV) {
print "\usage: $0 filename\n\n"; # $0 name of the program being executed
exit;
}
my $list = $ARGV[0];
chomp( $list );
unless (open(LIST, "<$list")) {
print "\n I can't open your list of genes!!! \n";
exit;
}
my( #list ) = (<LIST>);
close LIST;
open (CHR1, "<acembly_chr_sorted_by_exon_count.txt") or die;
my(#spreadsheet) = (<CHR1>);
close CHR1;
for (my $i = 0; $i < scalar #list; $i++ ) {
print "$i in list is $list[$i]\n";
for (my $j = 1; $j < scalar #spreadsheet; $j++ ) {
#print "$spreadsheet[$j]\n";
if ( $spreadsheet[$j] ) {
print "will $list[$i] match with $spreadsheet[$j]?\n";
}
else { print "no match\n" };
} #for
} #for
I plan to use a regex in the line if ( $spreadsheet[$j] ) { but am having a problem at this step as it is now. On the first interation, the line print "will $list[$i] match with $spreadsheet[$j]?\n"; prints $list[$i] OK but does not print $spreadsheet[$j]. This line will print both variables correctly on the second and following iterations. I do not see why?
At first glance nothing looks overtly incorrect. As mentioned in the comments the $j = 1 looks questionable but perhaps you are skipping the first row on purpose.
Here is a more perlish starting point that is tested. If it does not work then you have something going on with your input files.
Note the extended trailing whitespace removal. Sometimes if you open a WINDOWS file on a UNIX machine and use chomp, you can have embedded \r in your text that causes weird things to happen to printed output.
#!/usr/bin/perl
use strict; #use warnings;
unless(#ARGV) {
print "\usage: $0 filename\n\n"; # $0 name of the program being executed
exit;
}
my $list = shift;
unless (open(LIST, "<$list")) {
print "\n I can't open your list of genes!!! \n";
exit;
}
open(CHR1, "<acembly_chr_sorted_by_exon_count.txt") or die;
my #spreadsheet = map { s/\s+$//; $_ } <CHR1>;
close CHR1;
# s/\s+$//; is like chomp but trims all trailing whitespace even
# WINDOWS files opened on a UNIX system.
for my $item (<LIST>) {
$item =~ s/\s+$//; # trim all trailing whitespace
print "==> processing '$item'\n";
for my $row (#spreadsheet) {
if ($row =~ /\Q$item\E/) { # see perlre for \Q \E
print "match '$row'\n";
}
else {
print "no match '$row'\n";
}
}
}
close LIST;

Perl Regex match works, but replace does not

I have put together a Perl script to go through a directory and match various keys in the source and output the results to a text file. The match operation works well, however the end goal is to perform a replace operation. The Perl script is as follows:
#!/usr/bin/perl
#use strict;
use warnings;
#use File::Slurp;
#declare variables
my $file = '';
my $verbose = 0;
my $logfile;
my #files = grep {/[.](pas|cmm|ptd|pro)$/i} glob 'C:\users\perry_m\desktop\epic_test\pascal_code\*.*';
#iterate through the files in input directory
foreach $file (#files) {
print "$file\n";
#read the file into a single string
open FILEHANDLE, $file or die $!;
my $string = do { local $/; <FILEHANDLE> };
#perfrom REGEX on this string
########################################################
#fix the include formats to conform to normal PASCAL
$count = 0;
while ($string =~ m/%INCLUDE/g)
{
#%include
$count++;
}
if ($count > 0)
{
print " $count %INCLUDE\n";
}
$count = 0;
while ($string =~ m/INCLUDE/g)
{
#%INCLUDE;
$count++;
}
if ($count > 0)
{
print " $count INCLUDE\n";
}
$count = 0;
while ($string =~ m/(%include\s+')[A-Za-z0-9]+:([A-Za-z0-9]+.[A-Za-z]+')/g)
{
#$1$2;
$count++;
}
if ($count > 0)
{
print " $count XXXX:include \n";
}
}
This produces output as desired, an example is below:
C:\users\perry_m\desktop\epic_test\pascal_code\BRTINIT.PAS
1 INCLUDE
2 XXXX:include
39 external and readonly
However if I change the regex operations to try and implement a replace, using the replacement operation shown in the commented lines above, the scripts hangs and never returns. I imagine it is somehow related to memory, but I am new to Perl. I was also trying to avoid parsing the file by line if possible.
Example:
while ($string =~ s/%INCLUDE/%include/g)
{
#%include
$count++;
}
and
while ($string =~ s/(%include\s+')[A-Za-z0-9]+:([A-Za-z0-9]+.[A-Za-z]+')/$1$2;/g)
{
#$1$2;
$count++;
}
Edit: simplified the examples
The problem is with your while loops. A loop like
while ($string =~ m/INCLUDE/g) { ... }
will execute once for each ocurrence of INCLUDE in the target string, but a subtitution like
$string =~ s/INCLUDE/%INCLUDE;/
will make all of the replacement in one go and retuen the number of replacements made. So a loop
while ($string =~ s/INCLUDE/%INCLUDE;/g) { ... }
will endlessly add more and more percentage signs before and semicolons after every INCLUDE.
To find the number of replacements made, change all your loops like this to just
$count = $string =~ s/INCLUDE/%INCLUDE;/g
the pattern in s/INCLUDE/%INCLUDE/g will match the replacement also, so if you're running it in a while loop it will run forever (until you run out of memory).
s///g will replace all matches in a single shot so you very rarely will need to put it in a loop. Same goes for m//g, it will do the counting in a single step if you put it in list context.

In Perl, how can I remove all spaces that are not inside double quotes " "?

I'm tying to come up with some regex that will remove all space chars from a string as long as it's not inside of double quotes (").
Example string:
some string with "text in quotes"
Result:
somestringwith"text in quotes"
So far I've come up with something like this:
$str =~ /"[^"]+"|/g;
But it doesn't seem to be giving the intended result.
I'm honestly very new at perl and haven't had too much regexp experience. So if anyone willing to answer would also be willing to provide some insight into the why and how that would be great!
Thanks!
EDIT
String will not contain escaped "'s
It should actually always be formatted like this:
Some.String = "Some Value"
Result would be
Some.String="Some Value"
Here is a technique using split to separate the quoted strings. It relies on your data being consistent and will not work with loose quotes.
use strict;
use warnings;
my #line = split /("[^"]*")/;
for (#line) {
unless (/^"/) {
s/[ \t]+//g;
}
}
print #line; # line is altered
Basically, you split up the string in order to isolate the quoted strings. Once that is done, perform the substitution on all other strings. Since the array elements are aliased in the loop, substitutions are performed on the actual array.
You can run this script like so:
perl -n script.pl inputfile
To see the output. Or
perl -n -i.bak script.pl inputfile
To do in-place edit on inputfile, while saving backup in inputfile.bak.
With that said, I'm not sure what your edit means. Do you want to change
Some.String = "Some Value"
to
Some.String="Some Value"
Text::ParseWords is tailor-made for this:
#!/usr/bin/env perl
use strict;
use warnings;
use Text::ParseWords;
my #strings = (
q{This.string = "Hello World"},
q{That " string " and "another shoutout to my bytes"},
);
for my $s ( #strings ) {
my #words = quotewords '\s+', 1, $s;
print join('', #words), "\n";
}
Output:
This.string="Hello World"
That" string "and"another shoutout to my bytes"
Using Text::ParseWords means if you ever had to deal with quoted strings with escaped quotation marks in them, you'd be ready ;-)
Also, this sounds like you have a configuration file of some sort and you're trying to parse it. If that is the case, there are probably better solutions.
I suggest removing the quoted substrings using split and then recombining them with join after removing whitespace from the intermediate text.
Note that if the regex used for split contains captures then the captured values will also be included in the list returned.
Here's some sample code.
use strict;
use warnings;
my $source = <<END;
Some.String = "Some Value";
Other.String = "Other Value";
Last.String = "Last Value";
END
print join '', map {s/\s+// unless /"/; $_; } split /("[^"]*")/, $source;
output
Some.String= "Some Value";Other.String = "Other Value";Last.String = "Last Value";
I would simply loop through the string char by char. This way you can handle escaped strings too (just add an isEscaped variable).
my $text='lala "some thing with quotes " lala ... ';
my $quoteOpen = 0;
my $out;
foreach $char(split//,$text) {
if ($char eq "\"" && $quoteOpen==0) {
$quoteOpen = 1;
$out .= $char;
} elsif ($char eq "\"" && $quoteOpen==1) {
$quoteOpen = 0;
$out .= $char;
} elsif ($char =~ /\s/ && $quoteOpen==1) {
$out .= $char;
} elsif ($char !~ /\s/) {
$out .= $char;
}
}
print "$out\n";
Splitting on double quotes, removing spaces only from even fields (i.e. those in quotes):
sub remove_spaces {
my $string = shift;
my #fields = split /"/, $string . ' '; # trailing space needed to keep final " in output
my $flag = 1;
return join '"', map { s/ +//g if $flag; $flag = ! $flag; $_} #fields;
}
It can be done with regex:
s/([^ ]*|\"[^\"]*\") */$1/g
Note that this won't handle any kind of escapes inside the quotes.

Perl Regular Expression Question

I wrote a Perl program which reads text from text file and prints it out.
I want to print out a line which has specific format.
For example, there are some lines like this:
information:
Ahmad.prn:592118:2001:7:5:/Essay
Ashford.rtf:903615:2001:6:28:/usr/Essay
Barger.doc:243200:2001:7:4:/home/dir
end of Information.
I want to read only these three lines:
Ahmad.prn:592118:2001:7:5:/Essay
Ashford.rtf:903615:2001:6:28:/usr/Essay
Barger.doc:243200:2001:7:4:/home/dir
I think that the meaning of the fields is:
Ahmad.prn <- file name
592118 <- size of file
2001:7:5 <- created date
/Essay <- path of file
My code is this:
#!/usr/bin/perl
use strict;
use warnings;
open (my $infh, "<", $file)||die "cant open";
while(my $line = <$infh>) {
chomp ($line);
if ($line =~ /(what regular expression do I have to put in here?)/) {
print "$line";
}
}
close ($infh);
If lines you need always ends with /Essay, you may use following regex
/:\/Essay$/
Edit 1: looks there is middle parts are only numbers, you may match this way.
/:\d+:\d+:\d+:\d+:/
Since you have this format for Ahmad.prn:592118:2001:7:5:/Essay
Ahmad.prn <- file name
592118 <- size of file
2001:7:5 <- created date
/Essay <- path of file
you can use this regular expression
/^\s*(\S+):(\d+):(\d+:\d+:\d+):(\S+)\s*$/
With this you will have file name in $1, Size of the file in $2, Date of creation in $3, Path to the file in $4
I added optional spaces in the start and end of the line, if you want to allow optional spaces after or before : you can add \s*
#!/usr/bin/perl
use strict;
my $inputText = qq{
Ahmad.prn:592118:2001:7:5:/Essay
Ashford.rtf:903615:2001:6:28:/usr/Essay
Barger.doc:243200:2001:7:4:/home/dir
end of Information.
};
my #input = split /\n/, $inputText;
my $i = 0;
while ($input[$i] !~ /^end of Information.$/) {
if ($input[$i] !~ /:/) {
$i++;
next;
}
my ($fileName, $fileSize, $year, $month, $day, $filePath) = split /:/, $input[$i];
print "$fileName\t $fileSize\t $month/$day/$year\t $filePath\n";
$i++;
}
$line =~ ([a-zA-Z.]+):(\d+):(\d+):(\d+):(\d+):([\/A-Za-z]+)
$name = $1; #Ahmad.prn
$id = $2; #592118
$year = $3; #2001
$dir = $6; #/Essay
Note: loop through it for multiple names