Perl regexp how to get the file name out? - regex

I have this directory path:
\main\ABC_PRD\ABC_QEM\1\testQEM.txt\main\ABC_QEM\1
How can I get the file name testQEM.txt from the above string?
I use this:
$file =~ /(.+\\)(.+\..+)(\\.+)/;
But get this result:
file = testQEM.txt\main\ABC_QEM
Thanks,
Jirong

I'm not sure I understand, as paths cannot have a file node half way through them! Have multiple paths got concatenated somehow?
Anyway, I suggest you work though the path looking for the first node that validates as a real file using -f
Here is an example
use strict;
use warnings;
my $path = '\main\ABC_PRD\ABC_QEM\1\testQEM.txt\main\ABC_QEM\1';
my #path = split /\\/, $path;
my $file = shift #path;
$file .= '\\'.shift #path until -f $file or #path == 0;
print "$file\n";

/[^\\]+\.[^\\]+/
Capture anything separated by a . between two backslashes. Is this what you where looking for?

This is a bit difficult, as directory names can contain contain periods. This is especially true for *nix Systems, but is valid under Windows as well.
Therefore, each possible subpath has to be tested iteratively for file-ness.
I'd maybe try something like this:
my $file;
my $weirdPath = q(/main/ABC_PRD/ABC_QEM/1/testQEM.txt/main/ABC_QEM/1);
my #parts = split m{/} $weirdPath;
for my $i (0 .. $#parts) {
my $path = join "/", #parts[0 .. $i];
if (-f $path) { # optionally "not -d $path"
$file = $parts[$i];
last;
}
}
print "file=$file\n"; # "file=testQEM.txt\n"
I split the weird path at all slashes (change to backslashes if interoperability is not an issue for you). Then I join the first $i+1 elements together and test if the path is a normal file. If so, I store the last part of the path and exit the loop.
If you can guarantee that the file is the only part of the path that contains periods, then using one of the other solutions will be preferable.

my $file = '\main\ABC_PRD\ABC_QEM\1\testQEM.txt\main\ABC_QEM\1';
my ($result) = $file =~ /\\([^\\]+\.[^\\]+)\\/;
Parentheses around $result force the list context on the right hand side expression, which in turn returns what matches in parentheses.

Use regex pattern /(?=[^\\]+\.)([^\\]+)/
my $path = '\main\ABC_PRD\ABC_QEM\1\testQEM.txt\main\ABC_QEM\1';
print $1 if $path =~ /(?=[^\\]+\.)([^\\]+)/;
Test this code here.

>echo "\main\ABC_PRD\ABC_QEM\1\testQEM.txt\main\ABC_QEM\1"|perl -pi -e "s/.*([\\][a-zA-Z]*\.txt).*/\1/"
\testQEM.txt

i suggest you may comprehend principle of regexp Backtracking ,such as how * and + to work.
you only make a little change about your regexp as:
/(.+\\)(.+\..+?)(\\.+)/

Related

Extract the id from long file path ( regex) perl

I am trying to extract an id ( here eg 11894373690) from a file path that i read int my perl script -
/my/local/projects/Samplename/analysis/test/output/tool1/11894373690_cast/A1/A1a/
and I will further use it create a new path like
/my/local/projects/Samplename/analysis/test/output/tool2/11894373690_NEW/
I am not able to extract just the id from the path, can anyone please suggest an easy method in perl. I should definitely start learning regular expressions!
Thanks.
I am able to get only the last directory name
$file = "/my/local/projects/Samplename/analysis/test/output/tool1/11894373690_cast/A1/A1a/ ";
my ($id) = $file =~ /\.(A1[^]+)/i;
Update - Sorry all I misspelled "not" as "now" earlier! I am not able to extract the id. Thanks!
A simple regex or split are fine, but there are multiple core packages for working with paths.
This uses File::Spec to split the path and to later join the new one. Note that there is no escaping or such, no / counting -- in fact no need to even mention the separator.
use warnings 'all';
use strict;
use File::Spec::Functions qw(splitdir catdir);
my $path_orig = '...';
my #path = splitdir $path_orig;
my ($mark, $dir);
foreach my $i (0..$#path)
{
if ($path[$i] =~ m/(\d+)_cast/)
{
$dir = $1;
$mark = $i;
last;
}
}
my $path_new = catdir #path[0..$mark-1], $dir . '_NEW';
You can manipulate the #path array in other ways, of course -- peel components off of the back of it (pop #path while $path[-1] !~ /.../), or iterate and copy into a new array, etc.
The code above is simple and doesn't need extra data copy nor multiple regex matches.
Apparently the old and new path have another difference (tool1 vs tool2), please adjust. The main point is that once the path is split it is simple to go through the array.
As for a simple regex to fetch the id
my ($id) = $path =~ m{/(\d+)_cast/};
If \d+_cast is certain to be un-ambiguous (only one dir with that in its name) drop the / above.
What so you need to be fixed? and what will be dynamic? for this url, supposing that the posfix will aways be _cast you can use the expression:
(\d+)_cast
so the ID will be in the first selection group
I did find a way to get the id - it may not be efficient but works for now
I did
my $dir_path = "/my/local/projects/Samplename/analysis/test/output/tool1/11894373690_cast/A1/A1a/ ";
my #keys =(split(/[\/_]+/,$dir_path));
print "Key is $keys3[9]\n";
it prints out 11894373690
Thanks all for the suggestions!

Script or command to increment number in file name inside file

I have a file in which we have entries in following format. I would like to increment the numbers in file names inside this file. So some_v1.png will become some_v2.png. Is there a way with regex OR command line utility to achieve this.
Following is example file (file.config) with file entries as string.
something/some_v1.png
something/some_v4.png
something/some_v3.png
This looks like a great match for awk's "split" function:
awk '{n=split($0,a,"[1-9][0-9]*",s);for(i=1;i<n;++i)printf "%s%d",a[i],s[i]+1;print a[n]}'
The perl one-liner you already found also works great, with one exception: files with leading-zero numbers will lose the zeroes. Here is a fix for that using the magical auto-increment:
perl -pe 's/(\d+)/++($a=$1)/eg'
If you want to rename a bunch of files I'd use an auxiliary directory and a test to see if there is an actual file to rename.
mkdir aux
for i in {1..7} ; do
j=$($i + 1)
[ -f something/some_v${i}.png ] && mv something/some_v${i}.png aux/some_v${j}.png
done
mv aux/* something
rmdir aux
The use of a fixed name for the auxiliary directory could not stand a security review for repeated use in a dynamic production environment but I think it's fine for a one shot use in a controlled environment.
In perl:
#!/usr/bin/env perl
use strict;
use warnings;
foreach
my $filename (
sort { $b =~ s/.*(\d+).*/$1/r <=> $a =~ s/.*(\d+).*/$1/r }
glob "something/some_v*.png" )
{
chomp $filename;
if ( my ($vnum) = $filename =~ m/(\d+)\.png/ ) {
print "mv $filename ", $filename =~ s|\d+\.png|++$vnum.".png"|re,
"\n";
}
}
Note - sorting numerically, to ensure that you're never replacing 5 with 4, before you've renamed 5.

Efficiently matching a set of filenames with regex in Perl

I'm using Perl to capture the names of files in some specified folders that have certain words in them. The keywords in those filenames are "offers" or "cleared" and "regup" or "regdn". In other words, one of "offers" or "cleared" AND one of "regup" or "regdn" must appear in the filename to be a positive match. The two words could be in any order and there are characters/words that will appear in front of and behind them. A sample matching filename is:
2day_Agg_AS_Offers_REGDN-09-JUN-11.csv
I have a regex that successfully captures each of the matching filenames as a full path, which is what I wanted, but it seems inelegant and inefficient. Attempts at slightly better code have all failed.
Working approach:
# Get the folder names
my #folders = grep /^\d{2}-/, readdir DIR;
foreach my $folder ( #folders ) {
# glob the contents of the folder (to get the file names)
my #contents = <$folder/*>;
# For each filename in the list, if it matches, print it
foreach my $item ( #contents ) {
if ($item =~ /^$folder(?=.*(offers|cleared))(?=.*(regup|regdn)).*csv$/i){
print "$item\n";
}
}
}
Attempt at something shorter/cleaner:
foreach my $folder ( #folders ) {
# glob the contents of the folder (to get the file names)
my #contents = <$folder/*>;
# Seems to determine that there are four matches in each folder
# but then prints the first matching filename four times
my $single = join("\n", #contents);
for ($single =~ /^$folder(?=.*(offers|cleared))(?=.*(regup|regdn)).*csv$/im) {
print "$&\n";#"Matched: |$`<$&>$'|\n\n";
}
}
I've tried other formatting with the regex, using other options (/img, /ig, etc.), and sending the output of the regex to an array, but nothing has worked properly. I'm not great with Perl, so I'm positive I'm missing some big opportunities to make this whole procedure more efficient. Thanks!
Collect only these file names which contain offers or cleared AND regup or regdn
my #contents = grep { /offers|cleared/i && /regup|regdn/i } <$folder/*>;
Why would it be shorter or cleaner to use join instead of a loop? I'd say it makes it more complicated. What you seem to be doing is just matching loosely based on the conditions
name contains offers or cleared
name contains regup or regdn
name ends with .csv.
So why not just do this:
if ( $file =~ /offers|cleared/i and
$file =~ /regup|regdn/i and
$file =~ /csv$/i)
You might be interested in something like this:
use strict;
use warnings;
use File::Find;
my $dir = "/some/dir";
my #files;
find(sub { /offers|cleared/i &&
/regup|regdn/i &&
/csv$/i && push #files, $File::Find::name }, $dir);
Which would completely exclude the use of readdir and other loops. File::Find is recursive.

Bulk renaming files with bash and Perl based on file name

I'm looking to bulk rename files in the current directory only and remove certain strings from the end of file names.
Sample:
foo-bar-(ab-4529111094).txt
foo-bar-foo-bar-(ab-189534).txt
foo-bar-foo-bar-bar-(ab-24937932201).txt
the output should look like this:
foo-bar.txt
foo-bar-foo-bar.txt
foo-bar-foo-bar-bar.txt
I want to remove the string -(ab-2492201) at the end of each file name
knowing that the digits can vary in length.
A Perl regex is preferred over modules and without using any utilities and for bash oneliner command is highly preferred.
How to accomplish that in both Perl and Bash Shell on Linux? interested to know both solutions.
Try:
$ rename 's/-\(ab-\d+\)(?=\.txt$)//' *.txt
There's a rename command written in Perl. Its first argument is Perl code describing how to transform a filename. You could use the same s/// command in your own Perl program or one-liner.
If that doesn't work, try prename instead of rename; there's a different, non-Perl-based, rename command installed on some systems, in which case the Perl one may be called prename.
Using Perl Regex to Rename Files
With find, perl, and xargs, you could use this one-liner
find . -type f | perl -pe 'print $_; s/input/output/' | xargs -n2 mv
Results without calling mv should just be
OldName NewName
OldName NewName
OldName NewName
How does it work?
find . -type f outputs file paths (or file names...you control what gets processed by regex here!)
-p prints file paths to be processed by regex, -e executes inline script
print $_ prints the original file name first (independent of -p)
-n2 prints two elements per line
mv gets the input of the previous line
In bash, you could write something like:
for file in *-\(ab-[0-9]*\)*; do
newfile="${file/-(ab-[0-9]*)/}"
mv "$file" "$newfile"
done
When you say under the current directory, do you mean in the current directory, or anywhere in or beaneath the current directory and its descendants?
File::Find is a simple way to do the latter, and is a core module so won't need installing. Like so:
use strict;
use warnings;
use autodie;
use File::Find;
find(\&rename, '.');
sub rename {
return unless -f;
my $newname = $_;
return unless $newname =~ s/-\(ab-[0-9]+\)(\.txt)$/$1/i;
print "rename $_, $newname\n";
}
Update
This program will rename all the files with the given filename pattern only within the current directory.
Note that the initial open loop is there only to create sample files for renaming.
use strict;
use warnings;
use autodie;
open my $fh, '>', $_ for qw(
foo-bar-(ab-4529111094).txt
foo-bar-foo-bar-(ab-189534).txt
foo-bar-foo-bar-bar-(ab-24937932201).txt
);
for (glob '*.txt') {
next unless -f;
my $newname = $_;
next unless $newname =~ s/-\(ab-[0-9]+\)(\.txt)$/$1/i;
print "rename $_, $newname\n";
rename $_, $newname;
}
output
rename foo-bar-(ab-4529111094).txt, foo-bar.txt
rename foo-bar-foo-bar-(ab-189534).txt, foo-bar-foo-bar.txt
rename foo-bar-foo-bar-bar-(ab-24937932201).txt, foo-bar-foo-bar-bar.txt
A simpler, shorter (better ? :) ) rename regex :
rename 's#-\(.*?\)##' foo*.txt
check this:
ls -1 | nawk '/foo-bar-/{old=$0;gsub(/-\(.*\)/,"",$0);system("mv \""old"\" "$0)}'
> ls -1 foo*
foo-bar-(ab-4529111094).txt
foo-bar-foo-bar-(ab-189534).txt
foo-bar-foo-bar-bar-(ab-24937932201).txt
> ls -1 | nawk '/foo-bar-/{old=$0;gsub(/-\(.*\)/,"",$0);system("mv \""old"\" "$0)}'
> ls -1 foo*
foo-bar-foo-bar-bar.txt
foo-bar-foo-bar.txt
foo-bar.txt
>
For detailed explanation check here
Another way using just perl:
perl -E'for (<*.*>){ ($new = $_) =~ s/(^.+?)(-\(.+)(\..*$)/$1$3/; say $_." -> ".$new}'
(say ... is nice for testing, just replace it with rename $_,$new or rename($_,$new) )
<*.*> read every file in the current directory
($new = $_) =~ saves the following substitution in $new and leaves $_ as intact
(^.+?) save this match in $1 and non-greedy match from the beginning until...
(-\(.+) the sequence "-( ...anything..." is found. (this match would be saved in $2)
(\..*$) save everything from the last "." (period) before the end ($) of the line until and including the end of the line -> into $3
substitute the match with the string generated from $1$3
( you could also do it for a specific directory with perl -E'for (</tmp/my/directory/*.*>){ .....

How can I extract a filename from a path using Perl?

I have a Perl variable I populate from the database. Its name is $path. I need to get another variable $file which has just the filename from the pathname.
I tried:
$file = $path =~ s/.*\///;
I am very new to Perl.
Why reinvent the wheel? Use the File::Basename module:
use File::Basename;
...
$file = basename($path);
Why did $file=$path=~s/.*\///; not work?
=~ has higher precedence than =
So
$file = $path =~s/.*\///;
is treated as:
$file = ($path =~s/.*\///);
which does the replacement in $path and assigns either 1 (if replacement occurs) or '' (if no replacement occurs).
What you want is:
($file = $path) =~s/.*\///;
which assigns the value of $path to $file and then does the replacement in $path.
But again there are many problems with this solution:
It is incorrect. A filename in Unix based systems (not sure about Windows) can contain newline. But . by default does not match a newline. So you'll have to use a s modifier so that . matches newline as well:
($file = $path) =~s/.*\///s;
Most importantly it is not portable as it is assuming / is the path separator which is not the case with some platforms like Windows (which uses \), Mac (which uses :). So use the module and let it handle all these issues for you.
use File::Basename
Check the below link for a detailed description on how it works:
http://p3rl.org/File::Basename
I think the best way of doing this is -
use File::Basename;
my $file_name = basename($0);
So the variable $file_name will have the name of your script
Path::Class may seem like overkill at first—making objects of file and dir paths—but it can really pay off in complicated scripts and offers lots of bonuses that will prevent spaghetti when you get backed into a corner by scope creep. File::Spec is used in the first example for fun to resolve path.
use warnings;
use strict;
use Path::Class qw( file );
use File::Spec;
# Get the name of the current script with the procedural interface-
my $self_file = file( File::Spec->rel2abs(__FILE__) );
print
" Full path: $self_file", $/,
"Parent dir: ", $self_file->parent, $/,
" Just name: ", $self_file->basename, $/;
# OO
my $other = Path::Class::File->new("/tmp/some.weird/path-.unk#");
print "Other file: ", $other->basename, $/;
$url=~/\/([^\/]+)$/;
print "Filename $1\n";
As easy as that:
$path =~ /.*[\/\\](.*)/; # will return 1 (or 0) and set $1
my $file = $1; # $1 contains the filename
To check if an filename is available use:
$file = $1 if $path =~ /.*[\/\\](.*)/;
The pattern:
.*[\/\\](.*)
| | |
| | \- at last there is a group with the filename
| \------- it's the last / in linux or the last \ in windows
\--------- .* is very greedy, so it takes all it could
Use e.g. https://regex101.com/ to check regular expressions.
Extracting file name from path is very easy for both Unix and Windows file systems without need any packages:
my $path;
$path = 'C:\A\BB\C\windows_fs.txt'; # Windows
#$path = '/a/bb/ccc/ddd/unix_fs.txt'; # Unix
my $file = (split( /\/|\\/, $path))[-1];
print "File: $file\n";
# variable $file is "windows_fs.txt" for Windows
# variable $file is "unix_fs.txt" for Unix
The logic is very simple: create an array of all elements making the path and retrieve the last one. Perl allows to use negative indexes starting from end of the array. Index "-1" corresponds to the last element.