Searching a file name using regex and glob variable - regex

I have a requirement to find a file name display a text if it is present and below is the code I am using
use strict;
use warnings;
use feature "say";
my #arr={"file1*.gz","file2*.gz"};
foreach my $file (#arr) {
my $file1=glob ("$file");
say "$file1";
if (-e $file1) {
say "File Generated using script";
}
}
When I use the below code, I am able to get 1st element of array properly, but for the 2nd element, I am seeing below error:
Use of uninitialized value $file1 in string
And if the size of the array is 1, then it is working properly.
I am not sure what's going wrong in the above code.

There is a problem with this line:
my #arr={"file1*.gz","file2*.gz"};
I think you meant to use parentheses instead of curlies to create your array; what you have is a hash reference.
There is also a problem with this line:
my $file1=glob ("$file");
Instead of returning a file name, glob returns undef the second time because you are using it in scalar context and:
In scalar context, glob iterates through such filename expansions,
returning undef when the list is exhausted.
You can use glob in list context, which can be enforced with parentheses around $file:
use strict;
use warnings;
use feature "say";
my #arr = ("file1*.gz", "file2*.gz");
foreach my $file (#arr) {
my ($file1) = glob($file);
say $file1;
if ( -e $file1 ) {
say "File Generated using script";
}
}
This code grabs only the 1st file name that matches your wildcard. If you want all files that match, you need to add another for loop.

You are not using all the power of the glob function. As toolic says, you are also using curly braces {} wrong -- they create a hash reference, not a list.
Your options in normal array assignment are typically:
my #arr = ('foo', 'bar'); # parenthesis
my #arr = qw(foo bar); # "quote word", basically a string split on space
But this is not relevant to the answer to your question: How to find the names of files that exist that match the glob expression.
First off, the argument to glob can contain several patterns, separated by space. For example, from the documentation perldoc -f glob:
...glob("*.c *.h") matches all files with a .c or .h extension.
You should read the entire documentation, it is very enlightening, and core Perl knowledge.
So you do not need to loop around your number of glob patterns, just concatenate them. E.g.
my $globs = "file1*.gz file2*.gz"; # two patterns at once
But there is more. You can use curly braces in globs, creating an alternation. For example, {1,2} will create two alternations with 1 and 2 respectively, so we can simplify your expression further
my $globs = "file{1,2}*.gz"; # still two patterns at once
And there is more. You do not need to store these glob patterns in an array and loop over it, you can just loop over the glob result itself. E.g.
for my $file (glob $globs) { # loop over globs
You also do not need to check if a file exists with -e, as the glob already takes care of that check. If the glob does not return a file name, it was not found. It works much the same as using globs on the command line in bash.
So in short, you can use something like this:
use strict;
use warnings;
use feature "say";
foreach my $file (glob "file{1,2}*.gz") {
say "File '$file1' found";
}

Perhaps OP intended something of following kind
use strict;
use warnings;
use feature 'say';
my #patterns = qw/file1*.gz file2*.gz/;
for my $pat (#patterns) {
say 'Pattern: ' . $pat;
for my $fname ( glob($pat) ) {
say "\t$fname :: File Generated using script" if -e $fname;
}
}

Related

How to get regex to work in a perl script?

I am working on a Linux based Debian environment (precisely a Proxmox server) and I am writing a perl script.
My problem is : I have a folder with some files in it, every files in this folder have a number as a name (example : 100, 501, 102...). The lowest number possible is 100 and there is no limit for the greatest.
I want my script to only return files whose name is between 100 and 500.
So, I write this :
system(ls /the/path/to/my/files | grep -E "^[1-4][0-9]{2}|5[0]{2}");
I think my regex and the command are good because when I type this into a terminal, this is working.
But soon as I execute my script, I have those errors messages :
String found where operator expected at backupsrvproxmox.pl line 3, near "E "^[1-4][0-9]{2}|5[0]{2}""
(Do you need to predeclare E?)
Unknown regexp modifier "/b" at backupsrvproxmox.pl line 3, at end of line
syntax error at backupsrvproxmox.pl line 3, near "E "^[1-4][0-9]{2}|5[0]{2}""
Execution of backupsrvproxmox.pl aborted due to compilation errors.
I also tried with egrep but still not working.
I don't understand why the error message is about the /b modifier since I only use integer and no string.
So, any help would be good !
Instead of using system tools via system can very nicely do it all in your program
my #files = grep {
my ($n) = m{.*/([0-9]+)}; #/
defined $n and $n >= 100 and $n <= 500;
}
glob "/the/path/to/my/files/*"
This assumes that numbers in file names are at the beginning of the filename, picked up from the quesiton, so the subpattern for the filename itself directly follows a /. †
  (That "comment" #/ on the right is there merely to turn off wrong and confusing syntax highlighting in the editor.)
The command you tried didn't work because of the wrong syntax, since system takes either a string or a list of strings while you give it a bunch of "bareword"s, what confused the interpreter to emit a complex error message (most of the time perl's error messages are right to the point).
But there is no need to suffer through syntax details, which can get rather involved for this, nor with shell invocations which are complex and messy (under the hood), and inefficient.
† It also assumes that the files are not in the current directory -- clearly, since a path is passed to glob (and not just * for files in the current directory), which returns the filename with the path, and which is why we need the .*/ to greedily get to the last / before matching the filename.
But if we are in the current directory that won't work since there wouldd be no / in the filename. To include this possibility the regex need be modified, for example like
my ($n) = m{ (?: .*/ | ^) ([0-9]+)}x;
This matches filenames beginning with a number, either after the last slash in the path (with .*/ subpattern) or at the beginning of the string (with ^ anchor).
The modifier /x makes it discard literal spaces in the pattern so we can use them freely (along with newlines and # for comments!) to make that mess presumably more readable. Then I also use {} for delimiters so to not have to escape the / in the pattern (and with any delimiters other than // we must have that m).
Using a regular expression to try to match a range of numbers is just a pain. And this is perl; no need to shell out to external programs to get a list of files (Generally also a bad idea in shell scripts; see Why you shouldn't parse the output of ls(1))!
#!/usr/bin/env perl
use strict;
use warnings;
use feature qw/say/;
sub getfiles {
my $directory = shift;
opendir my $dir, $directory or die "Unable to open $directory: $!";
my #files =
grep { /^\d+$/ && $_ >= 100 && $_ <= 500 } readdir $dir;
closedir $dir;
return #files;
}
my #files = getfiles '/the/path/to/my/files/';
say "#files";
Or using the useful Path::Tiny module:
#!/usr/bin/env perl
use strict;
use warnings;
use feature qw/say/;
use Path::Tiny;
# Returns a list of Path::Tiny objects, not just names.
sub getfiles {
my $dir = path($_[0]);
return grep { $_ >= 100 && $_ <= 500 } $dir->children(qr/^\d+$/);
}
my #files = getfiles '/the/path/to/my/files/';
say "#files";

perl search and replace a substring

I am trying to search for a substring and replace the whole string if the substring is found. in the below example someVal could be any value that is unknown to me.
how i can search for someServer.com and replace the whole string $oldUrl and with $newUrl?
I can do it on the whole string just fine:
$directory = "/var/tftpboot";
my $oldUrl = "someVal.someServer.com";
my $newUrl = "someNewVal.someNewServer.com";
opendir( DIR, $directory ) or die $!;
while ( my $files = readdir(DIR) ) {
next unless ( $files =~ m/\.cfg$/ );
open my $in, "<", "$directory/$files";
open my $out, ">", "$directory/temp.txt";
while (<$in>) {
s/.*$oldUrl.*/$newUrl/;
print $out $_;
}
rename "$directory/temp.txt", "$directory/$files";
}
Your script will delete much of your content because you are surrounding the match with .*. This will match any character except newline, as many times as it can, from start to end of each line, and replace it.
The functionality that you are after already exists in Perl, the use of the -pi command line switches, so it would be a good idea to make use of it rather than trying to make your own, which works exactly the same way. You do not need a one-liner to use the in-place edit. You can do this:
perl -pi script.pl *.cfg
The script should contain the name definitions and substitutions, and any error checking you need.
my $old = "someVal.someServer.com";
my $new = "someNewVal.someNewServer.com";
s/\Q$old\E/$new/g;
This is the simplest possible solution, when running with the -pi switches, as I showed above. The \Q ... \E is the quotemeta escape, which escapes meta characters in your string (highly recommended).
You might want to prevent partial matches. If you are matching foo.bar, you may not want to match foo.bar.baz, or snafoo.bar. To prevent partial matching, you can put in anchors of different kinds.
(?<!\S) -- do not allow any non-whitespace before match
\b -- match word boundary
Word boundary would be suitable if you want to replace server1.foo.bar in the above example, but not snafoo.bar. Otherwise use whitespace boundary. The reason we do a double negation with a negative lookaround assertion and negated character class is to allow beginning and end of line matches.
So, to sum up, I would do:
use strict;
use warnings;
my $old = "someVal.someServer.com";
my $new = "someNewVal.someNewServer.com";
s/(?<!\S)\Q$old\E(?!\S)/$new/g;
And run it with
perl -pi script.pl *.cfg
If you want to try it out beforehand (highly recommended!), just remove the -i switch, which will make the script print to standard output (your terminal) instead. You can then run a diff on the files to inspect the difference. E.g.:
$ perl -p script.pl test.cfg > test_replaced.cfg
$ diff test.cfg test_replaced.cfg
You will have to decide whether word boundary is more desirable, in which case you replace the lookaround assertions with \b.
Always use
use strict;
use warnings;
Even in small scripts like this. It will save you time and headaches.
If you want to match and replace any subdomain, then you should devise a specific regular expression to match them.
\b(?i:(?!-)[a-z0-9-]+\.)*someServer\.com
The following is a rewrite of your script using more Modern Perl techniques, including Path::Class to handle file and directory operations in a cross platform way and $INPLACE_EDIT to automatically handle the editing of a file.
use strict;
use warnings;
use autodie;
use Path::Class;
my $dir = dir("/var/tftpboot");
while (my $file = $dir->next) {
next unless $file =~ m/\.cfg$/;
local #ARGV = "$file";
local $^I = '.bak';
while (<>) {
s/\b(?i:(?!-)[a-z0-9-]+\.)*someServer\.com\b/someNewVal.someNewServer.com/;
print;
}
#unlink "$file$^I"; # Optionally delete backup
}
Watch for the Dot-Star: it matches everything that surrounds the old URL, so the only thing remaining on the line will be the new URL:
s/.*$oldUrl.*/$newUrl/;
Better:
s/$oldUrl/$newUrl/;
Also, you might need to close the output file before you try to rename it.
If the old URL contains special characters (dots, asterisks, dollar signs...) you might need to use \Q$oldUrl to suppress their special meaning in the regex pattern.

Perl: how to supply regexp list from file

So, I need parse a file and if something matches the pattern, replace it with something:
while(<$ifh>) {
s/(.*pattern_1*)/$1\nsome more stuff/ ;
s/(.*pattern_2*)/$1\neven more stuff/;
s/(.*pattern_3.*)// ;
# and so on ...
print $ofh $_;
}
Question: what would be a simplest way to have this regexp rules list in the file (something similar to sed '-f' option)?
EDIT: perhaps I need to clarify a bit. We need to have the regexp rules in the separate file (not in the parsed file - although this was nice, thanks!), so they are not hardcoded. So, basically the external file should consist of 's//' lines.
Of course this can probably be done with foreach loop and eval, or even with external call to sed, but I suppose there can be something nicer.
regards, Wojtek
You could create patterns and store them in a Perl list, then simply iterate through your patterns in your input loop.
my #patterns = ( qr/pattern1/, qr/pattern2/, qr/pattern3/, etc..);
while(<$ifh>) {
for my $pattern (#patterns) {
s/($pattern)/$1\neven more stuff/;
}
print $ofh $_;
}
Based on your edit, here's a way to store regular expressions in an external file. Note that I would not recommend storing whole s/.../.../ statements in an external file, but you could use the eval function to accomplish it. Here's how I would solve this, however:
my (#regexes,$i);
$i=0;
while (my $line=<$rifh>) {
chomp($line);
if (index($line, "::")>-1) {
my #texts=split(/::/,$line);
$regexes[$i]{"to find"} = qr!$texts[0]!;
$regexes[$i]{"replace with"} = $texts[1];
$i++;
}
}
while (my $line=<$ifh>) {
chomp($line);
foreach my $regex (#regexes)
$line ~= s!$regex->{"to find"}!$regex->{"replace with"}!;
print $ofh $line . "\n";
}
(My use of ! as a regexp delimiter is my own style mechanism, often useful for allowing bare / characters to be used directly in a regexp definition.)

Efficiently matching a set of filenames with regex in Perl

I'm using Perl to capture the names of files in some specified folders that have certain words in them. The keywords in those filenames are "offers" or "cleared" and "regup" or "regdn". In other words, one of "offers" or "cleared" AND one of "regup" or "regdn" must appear in the filename to be a positive match. The two words could be in any order and there are characters/words that will appear in front of and behind them. A sample matching filename is:
2day_Agg_AS_Offers_REGDN-09-JUN-11.csv
I have a regex that successfully captures each of the matching filenames as a full path, which is what I wanted, but it seems inelegant and inefficient. Attempts at slightly better code have all failed.
Working approach:
# Get the folder names
my #folders = grep /^\d{2}-/, readdir DIR;
foreach my $folder ( #folders ) {
# glob the contents of the folder (to get the file names)
my #contents = <$folder/*>;
# For each filename in the list, if it matches, print it
foreach my $item ( #contents ) {
if ($item =~ /^$folder(?=.*(offers|cleared))(?=.*(regup|regdn)).*csv$/i){
print "$item\n";
}
}
}
Attempt at something shorter/cleaner:
foreach my $folder ( #folders ) {
# glob the contents of the folder (to get the file names)
my #contents = <$folder/*>;
# Seems to determine that there are four matches in each folder
# but then prints the first matching filename four times
my $single = join("\n", #contents);
for ($single =~ /^$folder(?=.*(offers|cleared))(?=.*(regup|regdn)).*csv$/im) {
print "$&\n";#"Matched: |$`<$&>$'|\n\n";
}
}
I've tried other formatting with the regex, using other options (/img, /ig, etc.), and sending the output of the regex to an array, but nothing has worked properly. I'm not great with Perl, so I'm positive I'm missing some big opportunities to make this whole procedure more efficient. Thanks!
Collect only these file names which contain offers or cleared AND regup or regdn
my #contents = grep { /offers|cleared/i && /regup|regdn/i } <$folder/*>;
Why would it be shorter or cleaner to use join instead of a loop? I'd say it makes it more complicated. What you seem to be doing is just matching loosely based on the conditions
name contains offers or cleared
name contains regup or regdn
name ends with .csv.
So why not just do this:
if ( $file =~ /offers|cleared/i and
$file =~ /regup|regdn/i and
$file =~ /csv$/i)
You might be interested in something like this:
use strict;
use warnings;
use File::Find;
my $dir = "/some/dir";
my #files;
find(sub { /offers|cleared/i &&
/regup|regdn/i &&
/csv$/i && push #files, $File::Find::name }, $dir);
Which would completely exclude the use of readdir and other loops. File::Find is recursive.

Is possible use a foreach loop of Perl to process different files with R?

im tried to process different files like an input to R script, for this I use a foreach loop in Perl, but R send me a warning:
Problem while running this R command:
a <- read.table(file="~/Desktop/ncRNA/Data/Inputs/Boxplot_all/All/$newquery")
Error:
file(file, "rt") : cannot open the connection
Calls: read.table -> file
In addition: Warning message:
In file(file, "rt") :
cannot open file '/Users/cristianarleyvelandiahuerto/Desktop/ncRNA/Data/Inputs/Boxplot_all/All/$newquery': No such file or directory
Execution halted
My code is:
#!/usr/bin/perl
use strict;
use warnings;
use Statistics::R;
use Data::Dumper;
my $R = Statistics::R->new();
my #query = (
'~/Desktop/ncRNA/Data/Inputs/Boxplot_all/All/dvex_all_rRNA_ce.RF00001.txt',
'~/Desktop/ncRNA/Data/Inputs/Boxplot_all/All/dvex_all_rRNA_ce_60.RF00001.txt',
'~/Desktop/ncRNA/Data/Inputs/Boxplot_all/All/dvex_all_rRNA_ce_70.RF00001.txt',
'~/Desktop/ncRNA/Data/Inputs/Boxplot_all/All/dvex_all_rRNA_ce_80.RF00001.txt',
'~/Desktop/ncRNA/Data/Inputs/Boxplot_all/All/dvex_all_rRNA_ce_90.RF00001.txt'
);
foreach my $query(#query) {
my $newquery = $query;
$newquery =~ s/(.*)\/(dvex_all.*)(\.txt)/$2$3/g;
print "$newquery\n";
$R->run(q`a <- read.table(file="~/Desktop/ncRNA/Data/Inputs/Boxplot_all/All/$newquery")`);
$R->run(q`res <- summary(a$V2)`);
my $output_value = $R->get('res');
print "Statistical Summary = $output_value\n";
}
With regex I changed the name of the input, but R don't recognizes this like as file. Can I do that? Some suggestions? Thanks!
You have:
$R->run(q`...`);
i.e., you're using the q operator. String interpolation is not done with q. The immediate solution is to use
$R->run(qq`...`);
You used Perl's quote operator q(), which has the same semantics as a single quoted string — that is, no variable interpolation. If you don't want to use a double quoted string (or the equivalent qq() operator), then you have to concatenate the variable into your query:
use feature 'say';
my $workdir = "~/Desktop/ncRNA/Data/Inputs/Boxplot_all/All/";
for my $query (#query) {
(my $newquery = $query) =~ s{\A.*/(?=dvex_all.*[.]txt\z)}{};
say $newquery;
# actually, here you should escape any <"> double quotes in $workdir and $newquery.
$R->run(q#a <- read.table(file="# . $workdir . $newquery . q#")#);
$R->run(q#res <- summary(a$V2)#);
my $output_value = $R->get('res');
say "Statistical Summary = $output_value";
}
Other improvements I made:
proper intendation is the first step to correct code
The say function is more comfortable than print.
The substitution has a better delimiter, and now cleary shows what it does: Deleting the path from the filename. Actually, we should be using one of the cross-platform modules that do this.
I used the substitute in copy idiom (my $copy = $orig) =~ s///. In perl5 v16, you can use the /r flag instead: my $copy = $orig =~ s///r.
The /g flag for the regex is useless.
I anchored the match at the start and end of the string.
The q`` strings now have a more visible delimiter
I don't know r. but it looks like you should concatenate your string.
$R->run(qa <- read.table(file="~/Desktop/ncRNA/Data/Inputs/Boxplot_all/All/".$newquery));
R will not concatenate a string using a "$" operator. That "$query" is inside an R function argument so you cannot rely on Perl-operators. (Caveat: I'm not a Perl-user, so I'm making an assumption that your code is creating an R object within the loop named 'query') If I'm right then you may need to use paste0():
The file argument might look like:
file=paste0("~/Desktop/ncRNA/Data/Inputs/Boxplot_all/All/", query)