Efficiently matching a set of filenames with regex in Perl

Efficiently matching a set of filenames with regex in Perl - regex

I'm using Perl to capture the names of files in some specified folders that have certain words in them. The keywords in those filenames are "offers" or "cleared" and "regup" or "regdn". In other words, one of "offers" or "cleared" AND one of "regup" or "regdn" must appear in the filename to be a positive match. The two words could be in any order and there are characters/words that will appear in front of and behind them. A sample matching filename is:
2day_Agg_AS_Offers_REGDN-09-JUN-11.csv
I have a regex that successfully captures each of the matching filenames as a full path, which is what I wanted, but it seems inelegant and inefficient. Attempts at slightly better code have all failed.
Working approach:
# Get the folder names
my #folders = grep /^\d{2}-/, readdir DIR;
foreach my $folder ( #folders ) {
# glob the contents of the folder (to get the file names)
my #contents = <$folder/*>;
# For each filename in the list, if it matches, print it
foreach my $item ( #contents ) {
if ($item =~ /^$folder(?=.*(offers|cleared))(?=.*(regup|regdn)).*csv$/i){
print "$item\n";
}
}
}
Attempt at something shorter/cleaner:
foreach my $folder ( #folders ) {
# glob the contents of the folder (to get the file names)
my #contents = <$folder/*>;
# Seems to determine that there are four matches in each folder
# but then prints the first matching filename four times
my $single = join("\n", #contents);
for ($single =~ /^$folder(?=.*(offers|cleared))(?=.*(regup|regdn)).*csv$/im) {
print "$&\n";#"Matched: |$`<$&>$'|\n\n";
}
}
I've tried other formatting with the regex, using other options (/img, /ig, etc.), and sending the output of the regex to an array, but nothing has worked properly. I'm not great with Perl, so I'm positive I'm missing some big opportunities to make this whole procedure more efficient. Thanks!

Collect only these file names which contain offers or cleared AND regup or regdn
my #contents = grep { /offers|cleared/i && /regup|regdn/i } <$folder/*>;

Why would it be shorter or cleaner to use join instead of a loop? I'd say it makes it more complicated. What you seem to be doing is just matching loosely based on the conditions
name contains offers or cleared
name contains regup or regdn
name ends with .csv.
So why not just do this:
if ( $file =~ /offers|cleared/i and
$file =~ /regup|regdn/i and
$file =~ /csv$/i)
You might be interested in something like this:
use strict;
use warnings;
use File::Find;
my $dir = "/some/dir";
my #files;
find(sub { /offers|cleared/i &&
/regup|regdn/i &&
/csv$/i && push #files, $File::Find::name }, $dir);
Which would completely exclude the use of readdir and other loops. File::Find is recursive.

Related

Searching a file name using regex and glob variable

I have a requirement to find a file name display a text if it is present and below is the code I am using
use strict;
use warnings;
use feature "say";
my #arr={"file1*.gz","file2*.gz"};
foreach my $file (#arr) {
my $file1=glob ("$file");
say "$file1";
if (-e $file1) {
say "File Generated using script";
}
}
When I use the below code, I am able to get 1st element of array properly, but for the 2nd element, I am seeing below error:
Use of uninitialized value $file1 in string
And if the size of the array is 1, then it is working properly.
I am not sure what's going wrong in the above code.

There is a problem with this line:
my #arr={"file1*.gz","file2*.gz"};
I think you meant to use parentheses instead of curlies to create your array; what you have is a hash reference.
There is also a problem with this line:
my $file1=glob ("$file");
Instead of returning a file name, glob returns undef the second time because you are using it in scalar context and:
In scalar context, glob iterates through such filename expansions,
returning undef when the list is exhausted.
You can use glob in list context, which can be enforced with parentheses around $file:
use strict;
use warnings;
use feature "say";
my #arr = ("file1*.gz", "file2*.gz");
foreach my $file (#arr) {
my ($file1) = glob($file);
say $file1;
if ( -e $file1 ) {
say "File Generated using script";
}
}
This code grabs only the 1st file name that matches your wildcard. If you want all files that match, you need to add another for loop.

You are not using all the power of the glob function. As toolic says, you are also using curly braces {} wrong -- they create a hash reference, not a list.
Your options in normal array assignment are typically:
my #arr = ('foo', 'bar'); # parenthesis
my #arr = qw(foo bar); # "quote word", basically a string split on space
But this is not relevant to the answer to your question: How to find the names of files that exist that match the glob expression.
First off, the argument to glob can contain several patterns, separated by space. For example, from the documentation perldoc -f glob:
...glob("*.c *.h") matches all files with a .c or .h extension.
You should read the entire documentation, it is very enlightening, and core Perl knowledge.
So you do not need to loop around your number of glob patterns, just concatenate them. E.g.
my $globs = "file1*.gz file2*.gz"; # two patterns at once
But there is more. You can use curly braces in globs, creating an alternation. For example, {1,2} will create two alternations with 1 and 2 respectively, so we can simplify your expression further
my $globs = "file{1,2}*.gz"; # still two patterns at once
And there is more. You do not need to store these glob patterns in an array and loop over it, you can just loop over the glob result itself. E.g.
for my $file (glob $globs) { # loop over globs
You also do not need to check if a file exists with -e, as the glob already takes care of that check. If the glob does not return a file name, it was not found. It works much the same as using globs on the command line in bash.
So in short, you can use something like this:
use strict;
use warnings;
use feature "say";
foreach my $file (glob "file{1,2}*.gz") {
say "File '$file1' found";
}

Perhaps OP intended something of following kind
use strict;
use warnings;
use feature 'say';
my #patterns = qw/file1*.gz file2*.gz/;
for my $pat (#patterns) {
say 'Pattern: ' . $pat;
for my $fname ( glob($pat) ) {
say "\t$fname :: File Generated using script" if -e $fname;
}
}

Trying to match two variables that both contain special characters in Perl

So here is a weird problem. I have a ton of scripts that are executed by "master" scripts and I need to verify that what is in the "master" is valid. Problem is, these scripts contain special characters and I need to match them to make sure the "Master" is referencing the correct scripts.
An example of one file might be
Example's of file names (20160517) [test].sh
Here is what my code looks like. #MasterScipt is an array where each element is a filename of what I expect the sub-scripts to be named.
opendir( DURR, $FileLocation ); # I'm looking in a directory where the subscripts reside
foreach ( readdir(DURR) ) {
for ( my $j = 0; $j != $MasterScriptlength; $j++ ) {
$MasterScipt[$j] =~ s/\r//g;
print "DARE TO COMPARE\n";
print "$MasterScipt[$j]\n";
print "$_\n";
#I added the \Q to quotemeta, but I think the issue is with $_
#I've tried variations like
#if(quotemeta($_) =~/\Q$MasterScipt[$j]/){
#To no avail, I also tried using eq operator and no luck :(
if ( $_ =~ /\Q$MasterScipt[$j]/ ) {
print "WE GOOD VINCENT\n";
}
}
}
closedir(DURR);
No matter what I seem to do, my output will always look like this
DARE TO COMPARE
Example's of file names (20160517) [test].sh
Example's of file names (20160517) [test].sh

OK, I was staring at this thing for too long, and I think writing this question out helped me answer it.
Not only did I need to add \Q in my regex, but there was a whitespace character. I did a chomp to both $_ and $MasterScipt[$j] and now its working.

I suggest that your code should look more like this. The main changes are that I have used a named variable $file for the values returned by readdir, and I iterate over the contents of the array #MasterScipt instead of its indexes because $j is never used in your own code except to access the array elements
s/\s+\z// for #MasterScipt;
opendir DURR, $FileLocation or die qq{Unable to open directory "$FileLocation": $!};
while ( my $file = readdir DURR ) {
for my $pattern ( #MasterScipt ) {
print "DARE TO COMPARE\n";
print "$pattern\n";
print "$file\n";
if ( $file =~ /\Q$pattern/ ) {
print "WE GOOD VINCENT\n";
}
}
}
closedir(DURR);
But this is a simple grep operation and it can be written as such. This alternative builds a single regular expression that will match any of the items in #MasterScipt and uses grep to build a list of all values returned by readdir that match it
s/\s+\z// for #MasterScipt;
my #matches = do {
my $re = join '|', map quotemeta, #MasterScipt;
opendir my $dh, $FileLocation or die qq{Unable to open directory "$FileLocation": $!};
grep { /$re/ } readdir $dh;
};

RegEx in perl that Uses Groups to Extract Information From A Filepath

So I need to take something in this format: 2015-08-15_15-41-32_44100_logo.txtand extract the date, time, and frequency from it, using these two pieces of code. Right now it's in the form <date>_<time>_<frequency>_logo.txt.Below is my attempt to make it a regex, but I know I'm missing something. How do I use groups in perl to do this?
The code below searches through a directory for every filepath that follows the pattern, and returns those files in a list. What I need help with is the regex itself. I need to be able to get the frequency.
$pattern =qr/^(\d+)-(\d+)-(\d+)_(\d+)-(\d+)-(\d+)_44100_(\w+).(\w+)$/;
#listFiles = grep_files($bee_music_dir,$pattern);
print join(",",#listFiles);
sub grep_files {
my ($dir, $pat) = #_;
opendir(my $dir_handle, $dir) or die $!;
my #files = grep { $_ =~ /$pat/ } readdir($dir_handle);
closedir($dir_handle);
return \#files;
}

Regular expression groups in perl are used like this:
my ($a, $b, $c) = $somestring=~ /(\d+)-(\d+)-(\d+)/;
Here, each variable in the list ($a, $b, $c) gets assigned the value of the matching groups, which are also available as $1, $2, and $3. So the above line is equivalent to:
$somestring =~ /(\d+)-(\d+)-(\d+)/;
my ($a, $b, $c) = ($1, $2, $3);
(you could even do my $a = $1; my $b = $2; my $c = $3).
If you want to declare a $pattern variable you should do it like this:
my $pattern = qr/(\d+)-(\d+)-(\d+)_(\d+)-(\d+)-(\d+)_(\d+)_(\w+).(\w+)/;
where qr is the quote-regexp operator, pre-compiling the regular expression for optimisation. You shouldn't use the =~ operator here, because it would apply the regular expression to $pattern rather than defining $pattern as that regular expression.
Defining a patter this way allows you to just
$stringtomatch =~ $pattern;
(but =~ /$pattern/ will also work).
The regular expression to match files formatted like 2015-08-15_15-41-32_44100_logo.txt or <date>_<time>_<frequency>_logo.txt looks like this:
/^(\d\d\d\d)-(\d\d)-(\d\d)_(\d\d)-(\d\d)-(\d\d)-(\d+)_logo\.txt$/
You could use \d+ but it won't necessarily match a date. Also, . in regular expressions means 'any character', so if you really mean . you should escape it: \..
Here's a more verbose version of part of your sub illustrating access to the groups:
my #files = ();
while ( my $file = readdir($dir_handle) ) {
if ( my ($year,$month,$day,$hour,$minute,$second,$freq) = $file =~ $pattern ) {
# do something with $freq
push #files, $file;
}
}
If all you are after is a list of the frequencies, it would suffice to only 'group' the wanted field:
my $pattern = qr/^\d+-\d+-\d+_\d+-\d+-\d+_(\d+)_logo\.txt$/;

You were close, just a few changes. Here's the script and a test run:
$ cat freq.pl
#!/usr/bin/perl --
use strict;
use warnings;
my $pattern = qr/^(\d+)-(\d+)-(\d+)_(\d+)-(\d+)-(\d+)_(\d+)_(\w+).(\w+)$/;
sub grep_files {
my ($dir, $pat) = #_;
opendir(my $dir_handle, $dir) or die $!;
my #files = grep { $_ =~ /$pat/ } readdir($dir_handle);
s/$pat/$7/ foreach #files;
closedir($dir_handle);
\#files;
}
print join("\n", #{grep_files '.', $pattern}), "\n";
$ ls
2015-08-15_15-41-32_44100_logo.txt freq.pl
2015-08-25_25-41-32_48000_logo.txt
$ ./freq.pl
44100
48000
freq.pl extracts the frequency from the filenames in the current directory. It's based on yours, with some key differences:
You're matching the pattern against an undefined variable. You really want to store the pattern in the variable. I also anchor the pattern at the beginning and end, so in the (admittedly unlikely event in this case) you have other files with stuff at the start or end, it won't match those by accident. You were also missing a semi-colon at the end of the line.
You were selecting the files that match the pattern, but then not extracting the frequency. The s/$pat/$7/ foreach #files; loops over all the files matching the pattern and replaces everything with just the 7th group, which is the frequency. You could also select files and extract the frequency in one step by using map instead of grep.
I added the last line for testing.
While not directly related, always use use strict and use warnings at the top of your scripts. use strict makes some questionable constructs errors and use warnings warns about some possible problems with the script.
The ls shows the examples files in the current directory, and freq.pl runs the script showing the output.

Perl: how to supply regexp list from file

So, I need parse a file and if something matches the pattern, replace it with something:
while(<$ifh>) {
s/(.*pattern_1*)/$1\nsome more stuff/ ;
s/(.*pattern_2*)/$1\neven more stuff/;
s/(.*pattern_3.*)// ;
# and so on ...
print $ofh $_;
}
Question: what would be a simplest way to have this regexp rules list in the file (something similar to sed '-f' option)?
EDIT: perhaps I need to clarify a bit. We need to have the regexp rules in the separate file (not in the parsed file - although this was nice, thanks!), so they are not hardcoded. So, basically the external file should consist of 's//' lines.
Of course this can probably be done with foreach loop and eval, or even with external call to sed, but I suppose there can be something nicer.
regards, Wojtek

You could create patterns and store them in a Perl list, then simply iterate through your patterns in your input loop.
my #patterns = ( qr/pattern1/, qr/pattern2/, qr/pattern3/, etc..);
while(<$ifh>) {
for my $pattern (#patterns) {
s/($pattern)/$1\neven more stuff/;
}
print $ofh $_;
}

Based on your edit, here's a way to store regular expressions in an external file. Note that I would not recommend storing whole s/.../.../ statements in an external file, but you could use the eval function to accomplish it. Here's how I would solve this, however:
my (#regexes,$i);
$i=0;
while (my $line=<$rifh>) {
chomp($line);
if (index($line, "::")>-1) {
my #texts=split(/::/,$line);
$regexes[$i]{"to find"} = qr!$texts[0]!;
$regexes[$i]{"replace with"} = $texts[1];
$i++;
}
}
while (my $line=<$ifh>) {
chomp($line);
foreach my $regex (#regexes)
$line ~= s!$regex->{"to find"}!$regex->{"replace with"}!;
print $ofh $line . "\n";
}
(My use of ! as a regexp delimiter is my own style mechanism, often useful for allowing bare / characters to be used directly in a regexp definition.)

Perl regexp how to get the file name out?

I have this directory path:
\main\ABC_PRD\ABC_QEM\1\testQEM.txt\main\ABC_QEM\1
How can I get the file name testQEM.txt from the above string?
I use this:
$file =~ /(.+\\)(.+\..+)(\\.+)/;
But get this result:
file = testQEM.txt\main\ABC_QEM
Thanks,
Jirong

I'm not sure I understand, as paths cannot have a file node half way through them! Have multiple paths got concatenated somehow?
Anyway, I suggest you work though the path looking for the first node that validates as a real file using -f
Here is an example
use strict;
use warnings;
my $path = '\main\ABC_PRD\ABC_QEM\1\testQEM.txt\main\ABC_QEM\1';
my #path = split /\\/, $path;
my $file = shift #path;
$file .= '\\'.shift #path until -f $file or #path == 0;
print "$file\n";

/[^\\]+\.[^\\]+/
Capture anything separated by a . between two backslashes. Is this what you where looking for?

This is a bit difficult, as directory names can contain contain periods. This is especially true for *nix Systems, but is valid under Windows as well.
Therefore, each possible subpath has to be tested iteratively for file-ness.
I'd maybe try something like this:
my $file;
my $weirdPath = q(/main/ABC_PRD/ABC_QEM/1/testQEM.txt/main/ABC_QEM/1);
my #parts = split m{/} $weirdPath;
for my $i (0 .. $#parts) {
my $path = join "/", #parts[0 .. $i];
if (-f $path) { # optionally "not -d $path"
$file = $parts[$i];
last;
}
}
print "file=$file\n"; # "file=testQEM.txt\n"
I split the weird path at all slashes (change to backslashes if interoperability is not an issue for you). Then I join the first $i+1 elements together and test if the path is a normal file. If so, I store the last part of the path and exit the loop.
If you can guarantee that the file is the only part of the path that contains periods, then using one of the other solutions will be preferable.

my $file = '\main\ABC_PRD\ABC_QEM\1\testQEM.txt\main\ABC_QEM\1';
my ($result) = $file =~ /\\([^\\]+\.[^\\]+)\\/;
Parentheses around $result force the list context on the right hand side expression, which in turn returns what matches in parentheses.

Use regex pattern /(?=[^\\]+\.)([^\\]+)/
my $path = '\main\ABC_PRD\ABC_QEM\1\testQEM.txt\main\ABC_QEM\1';
print $1 if $path =~ /(?=[^\\]+\.)([^\\]+)/;
Test this code here.

>echo "\main\ABC_PRD\ABC_QEM\1\testQEM.txt\main\ABC_QEM\1"|perl -pi -e "s/.*([\\][a-zA-Z]*\.txt).*/\1/"
\testQEM.txt

i suggest you may comprehend principle of regexp Backtracking ，such as how * and + to work.
you only make a little change about your regexp as:
/(.+\\)(.+\..+?)(\\.+)/

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Efficiently matching a set of filenames with regex in Perl - regex

Collect only these file names which contain offers or cleared AND regup or regdn my #contents = grep { /offers|cleared/i && /regup|regdn/i } <$folder/*>;

Related

Searching a file name using regex and glob variable

Trying to match two variables that both contain special characters in Perl

RegEx in perl that Uses Groups to Extract Information From A Filepath

Perl: how to supply regexp list from file

Perl regexp how to get the file name out?

Categories

Resources