How to get regex to work in a perl script?

How to get regex to work in a perl script? - regex

I am working on a Linux based Debian environment (precisely a Proxmox server) and I am writing a perl script.
My problem is : I have a folder with some files in it, every files in this folder have a number as a name (example : 100, 501, 102...). The lowest number possible is 100 and there is no limit for the greatest.
I want my script to only return files whose name is between 100 and 500.
So, I write this :
system(ls /the/path/to/my/files | grep -E "^[1-4][0-9]{2}|5[0]{2}");
I think my regex and the command are good because when I type this into a terminal, this is working.
But soon as I execute my script, I have those errors messages :
String found where operator expected at backupsrvproxmox.pl line 3, near "E "^[1-4][0-9]{2}|5[0]{2}""
(Do you need to predeclare E?)
Unknown regexp modifier "/b" at backupsrvproxmox.pl line 3, at end of line
syntax error at backupsrvproxmox.pl line 3, near "E "^[1-4][0-9]{2}|5[0]{2}""
Execution of backupsrvproxmox.pl aborted due to compilation errors.
I also tried with egrep but still not working.
I don't understand why the error message is about the /b modifier since I only use integer and no string.
So, any help would be good !

Instead of using system tools via system can very nicely do it all in your program
my #files = grep {
my ($n) = m{.*/([0-9]+)}; #/
defined $n and $n >= 100 and $n <= 500;
}
glob "/the/path/to/my/files/*"
This assumes that numbers in file names are at the beginning of the filename, picked up from the quesiton, so the subpattern for the filename itself directly follows a /. †
  (That "comment" #/ on the right is there merely to turn off wrong and confusing syntax highlighting in the editor.)
The command you tried didn't work because of the wrong syntax, since system takes either a string or a list of strings while you give it a bunch of "bareword"s, what confused the interpreter to emit a complex error message (most of the time perl's error messages are right to the point).
But there is no need to suffer through syntax details, which can get rather involved for this, nor with shell invocations which are complex and messy (under the hood), and inefficient.
† It also assumes that the files are not in the current directory -- clearly, since a path is passed to glob (and not just * for files in the current directory), which returns the filename with the path, and which is why we need the .*/ to greedily get to the last / before matching the filename.
But if we are in the current directory that won't work since there wouldd be no / in the filename. To include this possibility the regex need be modified, for example like
my ($n) = m{ (?: .*/ | ^) ([0-9]+)}x;
This matches filenames beginning with a number, either after the last slash in the path (with .*/ subpattern) or at the beginning of the string (with ^ anchor).
The modifier /x makes it discard literal spaces in the pattern so we can use them freely (along with newlines and # for comments!) to make that mess presumably more readable. Then I also use {} for delimiters so to not have to escape the / in the pattern (and with any delimiters other than // we must have that m).

Using a regular expression to try to match a range of numbers is just a pain. And this is perl; no need to shell out to external programs to get a list of files (Generally also a bad idea in shell scripts; see Why you shouldn't parse the output of ls(1))!
#!/usr/bin/env perl
use strict;
use warnings;
use feature qw/say/;
sub getfiles {
my $directory = shift;
opendir my $dir, $directory or die "Unable to open $directory: $!";
my #files =
grep { /^\d+$/ && $_ >= 100 && $_ <= 500 } readdir $dir;
closedir $dir;
return #files;
}
my #files = getfiles '/the/path/to/my/files/';
say "#files";
Or using the useful Path::Tiny module:
#!/usr/bin/env perl
use strict;
use warnings;
use feature qw/say/;
use Path::Tiny;
# Returns a list of Path::Tiny objects, not just names.
sub getfiles {
my $dir = path($_[0]);
return grep { $_ >= 100 && $_ <= 500 } $dir->children(qr/^\d+$/);
}
my #files = getfiles '/the/path/to/my/files/';
say "#files";

Related

Searching a file name using regex and glob variable

I have a requirement to find a file name display a text if it is present and below is the code I am using
use strict;
use warnings;
use feature "say";
my #arr={"file1*.gz","file2*.gz"};
foreach my $file (#arr) {
my $file1=glob ("$file");
say "$file1";
if (-e $file1) {
say "File Generated using script";
}
}
When I use the below code, I am able to get 1st element of array properly, but for the 2nd element, I am seeing below error:
Use of uninitialized value $file1 in string
And if the size of the array is 1, then it is working properly.
I am not sure what's going wrong in the above code.

There is a problem with this line:
my #arr={"file1*.gz","file2*.gz"};
I think you meant to use parentheses instead of curlies to create your array; what you have is a hash reference.
There is also a problem with this line:
my $file1=glob ("$file");
Instead of returning a file name, glob returns undef the second time because you are using it in scalar context and:
In scalar context, glob iterates through such filename expansions,
returning undef when the list is exhausted.
You can use glob in list context, which can be enforced with parentheses around $file:
use strict;
use warnings;
use feature "say";
my #arr = ("file1*.gz", "file2*.gz");
foreach my $file (#arr) {
my ($file1) = glob($file);
say $file1;
if ( -e $file1 ) {
say "File Generated using script";
}
}
This code grabs only the 1st file name that matches your wildcard. If you want all files that match, you need to add another for loop.

You are not using all the power of the glob function. As toolic says, you are also using curly braces {} wrong -- they create a hash reference, not a list.
Your options in normal array assignment are typically:
my #arr = ('foo', 'bar'); # parenthesis
my #arr = qw(foo bar); # "quote word", basically a string split on space
But this is not relevant to the answer to your question: How to find the names of files that exist that match the glob expression.
First off, the argument to glob can contain several patterns, separated by space. For example, from the documentation perldoc -f glob:
...glob("*.c *.h") matches all files with a .c or .h extension.
You should read the entire documentation, it is very enlightening, and core Perl knowledge.
So you do not need to loop around your number of glob patterns, just concatenate them. E.g.
my $globs = "file1*.gz file2*.gz"; # two patterns at once
But there is more. You can use curly braces in globs, creating an alternation. For example, {1,2} will create two alternations with 1 and 2 respectively, so we can simplify your expression further
my $globs = "file{1,2}*.gz"; # still two patterns at once
And there is more. You do not need to store these glob patterns in an array and loop over it, you can just loop over the glob result itself. E.g.
for my $file (glob $globs) { # loop over globs
You also do not need to check if a file exists with -e, as the glob already takes care of that check. If the glob does not return a file name, it was not found. It works much the same as using globs on the command line in bash.
So in short, you can use something like this:
use strict;
use warnings;
use feature "say";
foreach my $file (glob "file{1,2}*.gz") {
say "File '$file1' found";
}

Perhaps OP intended something of following kind
use strict;
use warnings;
use feature 'say';
my #patterns = qw/file1*.gz file2*.gz/;
for my $pat (#patterns) {
say 'Pattern: ' . $pat;
for my $fname ( glob($pat) ) {
say "\t$fname :: File Generated using script" if -e $fname;
}
}

perl search and replace a substring

I am trying to search for a substring and replace the whole string if the substring is found. in the below example someVal could be any value that is unknown to me.
how i can search for someServer.com and replace the whole string $oldUrl and with $newUrl?
I can do it on the whole string just fine:
$directory = "/var/tftpboot";
my $oldUrl = "someVal.someServer.com";
my $newUrl = "someNewVal.someNewServer.com";
opendir( DIR, $directory ) or die $!;
while ( my $files = readdir(DIR) ) {
next unless ( $files =~ m/\.cfg$/ );
open my $in, "<", "$directory/$files";
open my $out, ">", "$directory/temp.txt";
while (<$in>) {
s/.*$oldUrl.*/$newUrl/;
print $out $_;
}
rename "$directory/temp.txt", "$directory/$files";
}

Your script will delete much of your content because you are surrounding the match with .*. This will match any character except newline, as many times as it can, from start to end of each line, and replace it.
The functionality that you are after already exists in Perl, the use of the -pi command line switches, so it would be a good idea to make use of it rather than trying to make your own, which works exactly the same way. You do not need a one-liner to use the in-place edit. You can do this:
perl -pi script.pl *.cfg
The script should contain the name definitions and substitutions, and any error checking you need.
my $old = "someVal.someServer.com";
my $new = "someNewVal.someNewServer.com";
s/\Q$old\E/$new/g;
This is the simplest possible solution, when running with the -pi switches, as I showed above. The \Q ... \E is the quotemeta escape, which escapes meta characters in your string (highly recommended).
You might want to prevent partial matches. If you are matching foo.bar, you may not want to match foo.bar.baz, or snafoo.bar. To prevent partial matching, you can put in anchors of different kinds.
(?<!\S) -- do not allow any non-whitespace before match
\b -- match word boundary
Word boundary would be suitable if you want to replace server1.foo.bar in the above example, but not snafoo.bar. Otherwise use whitespace boundary. The reason we do a double negation with a negative lookaround assertion and negated character class is to allow beginning and end of line matches.
So, to sum up, I would do:
use strict;
use warnings;
my $old = "someVal.someServer.com";
my $new = "someNewVal.someNewServer.com";
s/(?<!\S)\Q$old\E(?!\S)/$new/g;
And run it with
perl -pi script.pl *.cfg
If you want to try it out beforehand (highly recommended!), just remove the -i switch, which will make the script print to standard output (your terminal) instead. You can then run a diff on the files to inspect the difference. E.g.:
$ perl -p script.pl test.cfg > test_replaced.cfg
$ diff test.cfg test_replaced.cfg
You will have to decide whether word boundary is more desirable, in which case you replace the lookaround assertions with \b.
Always use
use strict;
use warnings;
Even in small scripts like this. It will save you time and headaches.

If you want to match and replace any subdomain, then you should devise a specific regular expression to match them.
\b(?i:(?!-)[a-z0-9-]+\.)*someServer\.com
The following is a rewrite of your script using more Modern Perl techniques, including Path::Class to handle file and directory operations in a cross platform way and $INPLACE_EDIT to automatically handle the editing of a file.
use strict;
use warnings;
use autodie;
use Path::Class;
my $dir = dir("/var/tftpboot");
while (my $file = $dir->next) {
next unless $file =~ m/\.cfg$/;
local #ARGV = "$file";
local $^I = '.bak';
while (<>) {
s/\b(?i:(?!-)[a-z0-9-]+\.)*someServer\.com\b/someNewVal.someNewServer.com/;
print;
}
#unlink "$file$^I"; # Optionally delete backup
}

Watch for the Dot-Star: it matches everything that surrounds the old URL, so the only thing remaining on the line will be the new URL:
s/.*$oldUrl.*/$newUrl/;
Better:
s/$oldUrl/$newUrl/;
Also, you might need to close the output file before you try to rename it.
If the old URL contains special characters (dots, asterisks, dollar signs...) you might need to use \Q$oldUrl to suppress their special meaning in the regex pattern.

Using string variables containing literal escapes in a Perl substitution

I'm new to Perl and I found behaviour which I don't understand and can't solve.
I'm making a small find and replace program and there are some things I need to do. I have bunch of files that I need to process. Then I have a list of find / replace rules in an external text file. In replacing there I need three special things:
Replacing utf-8 characters (Czech diacritics)
Work with adding/removing lines (so working in a slurp mode)
Use a regular expressions
I want a program that works alone, so I wrote it so that it takes three arguments:
The file to work on
What to find
What to replace.
I'm sending parameters in a loop from a bash script which parse the rules list and loads other files.
My problem is when I have a "\n" string in a rules list and I send it to the Perl script. If it's in the first part of replacement (in the find section) it looks for a newline correctly, but when it's in the second part (the replace section) it just prints \n instead of a newline.
I tried hardcoding "\n" to the string right into the variable instead of passing it from the list and then it works fine.
What's the reason Perl doesn't interpret the "\n" string there, and how can I make it work?
This is my code:
list.txt - One line from the external replacement list
1\. ?\\n?NÁZEV PŘÍPRAVKU;\\n<<K1>> NÁZEV PŘÍPRAVKU;
farkapitoly.sh - The bash script for parsing list.txt and cycling through all of the files and calling the Perl script
...
FILE="/home/tmp.txt"
while read LINE
do
FIND=`echo "$LINE" | awk -F $';' 'BEGIN {OFS = FS} {print $1}'`
REPLACE=`echo "$LINE" | awk -F $';' 'BEGIN {OFS = FS} {print $2}'`
perl -CA ./pathtiny.pl "$FILE" "$FIND" "$REPLACE"
done < list.txt
...
pathtiny.pl - The Perl script for find and replace
#!/usr/bin/perl
use strict;
use warnings;
use Modern::Perl;
use utf8; # Enable typing Unicode in Perl strings
use open qw(:std :utf8); # Enable Unicode to STDIN/OUT/ERR and filehandles
use Path::Tiny;
my $file = path("$ARGV[0]");
my $searchStr = "$ARGV[1]";
my $replaceStr = "$ARGV[2]";
# $replaceStr="\n<<K1>> NÁZEV PRÍPRAVKU"; # if I hardcode it here \n is replaced right away
print("Search String:", "$searchStr", "\n");
print("Replace String:", "$replaceStr", "\n\n");
my $guts = $file->slurp_utf8;
$guts =~ s/$searchStr/$replaceStr/gi;
$file->spew_utf8($guts);
If it's important, I'm using Linux Mint 13 64-bit on VirtualBox (under Win 8.1) and I have Perl v5.14.2. Every file is UTF-8 with Linux endings.
Example files can be found on pastebin. this should end up like this.
But examples varies a lot. I need a universal solution to write down newline in a replacement string so it replaces correctly.

The problem is that the replacement string is read literally from the file, so if your file contains
xx\ny
then you will read exactly those six characters. Also, the replacement part of a substitution is evaluated as if it was in double quotes. So your replacement string is "$replaceStr" which interpolates the variable and goes no further, so you will again have xx\nyy in the new string. (By the way, please avoid using capital letters in local Perl identifiers as in practice they are reserved for globals such as Module::Names.)
The answer lies in using eval, or its equivalent - the /e modifier on the substitution.
If I write
my $str = '<b>';
my $r = 'xx\ny';
$str =~ s/b/$r/;
then the replacement string is interpolated to xx\ny, as you have experienced.
A single /e modifier evaluates the replacement as an expression instead of just a double-quoted string, but of course $r as an expression is xx\ny again.
What you need is a second /e modifier, which does the same evaluation as a single /e and then does an additional eval of the result on top. For this it is cleanest if you use qq{ .. } as you need two levels of quotation.
If you write
$str =~ s/b/qq{"$r"}/ee
then perl will evaluate qq{"$r"} as an expression, giving "xx\nyy", which, when evaluated again will give you the string you need - the same as the expression 'xx' . "\n" . 'yy'.
Here's a full program
use strict;
use warnings;
my $s = '<b>';
my $r = 'xx\nyy';
$s =~ s/b/qq{"$r"}/ee;
print $s;
output
<xx
yy>
But don't forget that, if your replacement string contains any double quotes, like this
my $r = 'xx\n"yy"'
then they must be escaped before putting the through the substitution as the expression itself also uses double quotes.
All of this is quite hard to grasp, so you may prefer the String::Escape module which has an unbackslash function that will change a literal \n (and any other escapes) within a string to its equivalent character "\n". It's not a core module so you probably will need to install it.
The advantage is that you no longer need a double evaluation, as the replacement string can be just unbackslash $r which give the right result if it evaluated as an expression. It also handles double quotes in $r without any problem, as the expression doesn't use double quotes itself.
The code using String::Escape goes like this
use strict;
use warnings;
use String::Escape 'unbackslash';
my $s = '<b>';
my $r = 'xx\nyy';
$s =~ s/b/unbackslash $r/e;
print $s;
and the output is identical to that of the previous code.
Update
Here is a refactoring of your original program that uses String::Escape. I have removed Path::Tiny as I believe it is best to use Perl's built-in inplace-edit extension, which is documented under the General Variables section of perlvar.
#!/usr/bin/perl
use utf8;
use strict;
use warnings;
use 5.010;
use open qw/ :std :utf8 /;
use String::Escape qw/ unbackslash /;
our #ARGV;
my ($file, $search, $replace) = #ARGV;
print "Search String: $search\n";
print "Replace String: $replace\n\n";
#ARGV = ($file);
$^I = '';
while (<>) {
s/$search/unbackslash $replace/eg;
print;
}

You got \n as a content of a string. (as two chacters 1: \ and second n, and not as one newline.
Perl interprets the \n as newline when it is as literal (e.g. it is in your code).
The quick-fix would be:
my $replaceStr=eval qq("$ARGV[2]"); #evaling a string causes interpreting the \n as literal
or, if you don't like eval, you can use the String-Escape cpan module. (the unbackslash function)

You're wanting a literal string to be treated as if it were a double quoted string. To do that you'll have to translate any backslash followed by another character.
The other experts have shown you how to do that over the entire string (which is risky since it uses eval with unvalidated data). Alternatively, you could use a module, String::Escape, which requires an install (not a high bar, but too high for some).
However, the following does a translation of the return value string itself in a safe way, and then it can be used like a normal value in your other search and replace:
use strict;
use warnings;
my $r = 'xx\nyy';
$r =~ s/(\\.)/qq{"$1"}/eeg; # Translate \. as a double quoted string would
print $r;
Outputs:
xx
yy

Perl regexp how to get the file name out?

I have this directory path:
\main\ABC_PRD\ABC_QEM\1\testQEM.txt\main\ABC_QEM\1
How can I get the file name testQEM.txt from the above string?
I use this:
$file =~ /(.+\\)(.+\..+)(\\.+)/;
But get this result:
file = testQEM.txt\main\ABC_QEM
Thanks,
Jirong

I'm not sure I understand, as paths cannot have a file node half way through them! Have multiple paths got concatenated somehow?
Anyway, I suggest you work though the path looking for the first node that validates as a real file using -f
Here is an example
use strict;
use warnings;
my $path = '\main\ABC_PRD\ABC_QEM\1\testQEM.txt\main\ABC_QEM\1';
my #path = split /\\/, $path;
my $file = shift #path;
$file .= '\\'.shift #path until -f $file or #path == 0;
print "$file\n";

/[^\\]+\.[^\\]+/
Capture anything separated by a . between two backslashes. Is this what you where looking for?

This is a bit difficult, as directory names can contain contain periods. This is especially true for *nix Systems, but is valid under Windows as well.
Therefore, each possible subpath has to be tested iteratively for file-ness.
I'd maybe try something like this:
my $file;
my $weirdPath = q(/main/ABC_PRD/ABC_QEM/1/testQEM.txt/main/ABC_QEM/1);
my #parts = split m{/} $weirdPath;
for my $i (0 .. $#parts) {
my $path = join "/", #parts[0 .. $i];
if (-f $path) { # optionally "not -d $path"
$file = $parts[$i];
last;
}
}
print "file=$file\n"; # "file=testQEM.txt\n"
I split the weird path at all slashes (change to backslashes if interoperability is not an issue for you). Then I join the first $i+1 elements together and test if the path is a normal file. If so, I store the last part of the path and exit the loop.
If you can guarantee that the file is the only part of the path that contains periods, then using one of the other solutions will be preferable.

my $file = '\main\ABC_PRD\ABC_QEM\1\testQEM.txt\main\ABC_QEM\1';
my ($result) = $file =~ /\\([^\\]+\.[^\\]+)\\/;
Parentheses around $result force the list context on the right hand side expression, which in turn returns what matches in parentheses.

Use regex pattern /(?=[^\\]+\.)([^\\]+)/
my $path = '\main\ABC_PRD\ABC_QEM\1\testQEM.txt\main\ABC_QEM\1';
print $1 if $path =~ /(?=[^\\]+\.)([^\\]+)/;
Test this code here.

>echo "\main\ABC_PRD\ABC_QEM\1\testQEM.txt\main\ABC_QEM\1"|perl -pi -e "s/.*([\\][a-zA-Z]*\.txt).*/\1/"
\testQEM.txt

i suggest you may comprehend principle of regexp Backtracking ，such as how * and + to work.
you only make a little change about your regexp as:
/(.+\\)(.+\..+?)(\\.+)/

How do I get the last directory from a URL path using a Zeus rewrite rule?

I need a regular expression that will return the last directory in a path.
e.g, from www.domain.com/shop/widgets/, return "widgets".
I have an expression that almost works.
[^/].*/([^/]+)/?$
It will return "widgets" from www.domain.com/shop/widgets/ but not from www.domain.com/widgets/
I also need to ignore any URLs that include a filename. So that www.domain.com/shop/widgets/blue_widget.html will not match.
This must be done using regular expressions as it is for the Zeus server request rewrite module.

/^www\.example\.com\/([^\/]+\/)*([^\/]+)\/$/
What does this do?
Matches normal text for the domain. Adjust this as required.
Matches any number of directories, each of which consists of non-slash characters followed by a slash.
Matches a string of non-slashes.
Matches a slash at the end of the input, thus eliminating files (since only directories end in a slash).
Implemented in Perl:
[ghoti#pc ~] cat perltest
#!/usr/local/bin/perl
#test = (
'www.example.com/path/to/file.html',
'www.example.com/match/',
'www.example.com/pages/match/',
'www.example.com/pages/widgets/thingy/',
'www.example.com/foo/bar/baz/',
);
foreach (#test) {
$_ =~ m/^www\.example\.com\/([^\/]+\/)*([^\/]+)\/$/i;
printf(">> %-50s\t%s\n", $_, $2);
}
[ghoti#pc ~] ./perltest
>> www.example.com/path/to/file.html
>> www.example.com/match/ match
>> www.example.com/pages/match/ match
>> www.example.com/pages/widgets/thingy/ thingy
>> www.example.com/foo/bar/baz/ baz
[ghoti#pc ~]

This should generally work:
/([^/.]+)/$
It matches a set of non-slash, non-period characters after the second-to-last slash in a string that must end in a slash.
The "folder name" will be in the first capture group.

#!/usr/bin/perl
use strict;
use warnings;
$_ = 'www.domain.com/shop/widgets/';
print "$1\n" if (/\/([^\/]+)\/$/);
$_ = 'www.domain.com/shop/widgets/blue_widget.html';
print "$1\n" if (/\/([^\/]+)\/$/);'

You don't want a Perl regular expression. You want a regular expression that Zeus will understand. Although they might call that PCRE, not even PCRE handles all Perl regular expressions.
Most of the answers here are wrong because they aren't thinking about the different sorts of URLs that you will can get as input.
Get just the path portion of the URL
Match against the path portion to find what you need
Distinguish between paths that end in a filename and those that don't
There are some examples that you can use as a start. I don't use Zeus and don't want to, so the next part is up to you:
Zeus Rewrite Rules
Mod Rewrite rule to Zeus Server rule (Codeigniter)
http://www.names.co.uk/support/support_centre_home/528-zeus_rewrite_rules_user_guide.html
http://drupal.org/node/46508
I've read that you can pass the request to a Perl program through Perl Extensions for ZWS, but I'd be surprised if you needed to do that. If you have to resort to that, I'd use the URI module to parse the URI and extract the path. Once you have that, split up the path into it's components:
use URI;
my $uri = URI->new( ... ); # I don't know how Zeus passes data
my $path = $uri->path;
# undef to handle the leading /
my( undef, #parts ) = split $path, '/';
Once you are this far, you have to decide how you want to recognize something as a directory. If you're mapping directly onto a filesystem structure, that is just a matter of popping elements off #parts until you find the directories, then counting back the number you want to skip.
However, I cringe at doing that, no matter what I put in the Perl program. I'd try really hard to get it done just in the Zeus rules first. Show us what you have so far.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js