Find sample name and directory name from a path - regex

My script receives a complete path name of a file from another script and I am trying to break this full path name in perl and pass this information to my script.
I am unable to extract it using split, can anyone please suggest on how to approach this -
I need to split a path which cold look like - path = /usr/local/projects/Tool/Work/Section12/Tool.Sample.2.pdf
to extract these values Sample1 and /usr/local/projects/Tool/Work/, so that I can use these to assign values to two variables in my script for example $Sample_id = Sample and $Dir=/usr/local/projects/Tool/Work/
Can anyone please suggest?
Thanks!

You should use the core File::Spec::Functions module so that your code respects any eccentricities of the platform you're working on. splitdir and catdir from that module separate and recombine path components, making your task mostly simple
I've used splitdir here to put the steps of your $path into array #path. The last element is the file name, which I've copied out using pop, and the second-from-last is Part12, which you don't seem to be interested in, so I've used another pop to get rid of that
Then all that's left is to rebuild $dir_path from what's left of #path, and extract the part of the file name that you're interested in
To do the latter there are several options, depending on what you mean. It could be the second field of the file name split on dots ., or the third from the end, split the same way. I've gone for the field that starts with sample in either upper or lower-case. A regex finds that for me
use strict;
use warnings 'all';
use File::Spec::Functions qw/ splitdir catdir /;
my $path = 'E:/usr/local/projects/Tool/Work/Part12/Tool.Sample01.2.pdf';
my #path = splitdir $path;
my $file = pop #path; # Copy and remove the file name from the end
my $local_dir = pop #path; # Remove `Part12` per requirement
my $dir_path = catdir #path; # Rebuild what is left of the path
# Pick the first subsequence of the file name that starts with `sample`
#
my ($sample) = grep /^sample/i, split /\./, $file;
print "\$sample = $sample\n";
print "\$dir_path = $dir_path\n";
output
$sample = Sample01
$dir_path = /usr/local/projects/Tool/Work

There are two parts to this -- split the full path, and extract particular components of some of its parts. Splitting a file name with the full path into its components is nicely done by a few modules. Here I'll use the core module File::Basename. Then the path and filename can be processed for specific requirements, and here I'll use regex.
use warnings;
use strict;
use File::Basename qw(fileparse);
my $fullname = '/usr/local/projects/Tool/Work/Section12/Tool.Sample.2.pdf';
# Parse it into the path and filename
my ($filename, $path) = fileparse($fullname);
# Extract needed part of the path: all except last directory
my ($dirs) = $path =~ m|(.*)/.*/|; # / stop editor coloring
# Extract needed part of filename: between the first `.` and the next
my ($tag) = $filename =~ /[^.]+\.([^.]+)/;
print "$dirs\n$tag\n";
This prints
/usr/local/projects/Tool/Work
Sample
The regex for pulling parts out of the path and filename are both specific to the task. The first one uses the fact that we only need to drop the last component of the path, so the greediness of .* works out right. In the second one, I use the fact that the pattern goes between the very first . and the next.
Note that in the basic invocation above the extension is not extracted and the filename is returned with its extension. Thanks to Borodin for bringing this up in a comment. See the documentation, as should always be done with any suggested modules.
This is by far the most common need when working with full paths. But if you want to get the extension split off as well then pass another argument, which can be a list of extensions to seek or a regex. Then the file-name part will be returned without the extension.
my ($base, $path, $ext) = fileparse($fullname, #suffix_list);
For example, #suffix_list can be qr/\.[^.]*/ and in this case we have
my ($base, $path, $ext) = fileparse($fullname, qr/\.[^.]*/);
print "$path\n$base\n$ext\n";
printing
/usr/local/projects/Tool/Work/Section12/
Tool.Sample.2
.pdf
A note on reliability, from docs:
You are guaranteed that $directories . $filename . $suffix will denote the same location as the original $path.

Related

Perl : Picking a varying filename with a specific extension from a specific folder

Problem: I have a specific folder in which i have only one file with a specific extension. This file is generated one and the name always varies...
I want to assign it to a variable and then pass to subroutine in perl
i tried as follows
my $file = "./abc/def/*.xml";
and also tried using one field on which i have control
my $file = "./abc/def/._${username}_..xml";
but i print the file name.... it says *.xml or ._name_..xml instead of the actual filename ...
Can someone tell me how to solve my problem.
I am new to perl...so any help here will be great.
I have searched few other places but couldn't find anything for this specific point.
You want glob:
my $file = glob './abc/def/*.xml';
Or perhaps:
my #files = glob './abc/def/*.xml';
Not sure why you're having problems with the $username part though. It should expand that var. (although, I don't know why you have _ either side).
You can print your files using the bellow way.Then you can use them.
Code:
my #dir = `ls ./abc/def/`;
foreach my $id (#dir)
{
chomp($id);
print "$id\n";
}
Hope this will help you.

can some one explain what this perl snippet involving regex is actually doing?

this code is supposed to give the extension and address of the file ,but how it is doing is not very clear.
($FileName, $TPath, $suffix) = fileparse($SourceFiles[$Index], '\.[^\.]*');
The \. matches a literal . and the [^\.]* matches everything until a . shows up. Essentially, it will match the extension.
But as #gaussblurinc said, it will also match a . by itself, try modifying it as follows:
qr(\.[^\.]+) # regex operator, catch dot (.) and everything behind it but not dot ([^\.])
Try experimenting here. I have already set it up a bit.
From documentation (fileparse):
my($filename, $directories, $suffix) = fileparse($path);
my($filename, $directories, $suffix) = fileparse($path, #suffixes);
my $filename = fileparse($path, #suffixes);
#If #suffixes are given each element is a pattern (either a string or
#a "qr//") matched against the end of the $filename. The matching
#portion is removed and becomes the $suffix.
# On Unix returns ("baz", "/foo/bar/", ".txt")
fileparse("/foo/bar/baz.txt", qr/\.[^.]*/);
so, I want to search through directories and output or perform some actions on images (resize them, for example with convert)
use File::Spec;
use File::Basename;
sub searchAndResizeImages{
my ($searchDirectory) = shift;
# use find or whatever you want, I prefer to use glob, oh
# prepare images suffixes
# for example, I know about png, gif, jpg, whatever else?
my #suffixes = qw(png gif jpe?g);
# and also try this: different output, but it can be more useful in your task
# my #suffixes = map{".$_"}qw(png gif jpe?g);
for (glob(File::Spec->catfile($searchDirectory, "*"))){
if (!-d) { # if not directory
my ($file, $path, $suffix) = fileparse($_,#suffixes);
if ($suffix){
print "file : $file\n";
print "path : $path\n";
print "suffix : $suffix\n";
# do whatever you want here with images
}
}
}
}
searchAndResizeImages('.');
fileparse is a function from File::Basename
my($filename, $directories, $suffix) = fileparse($path);
my($filename, $directories, $suffix) = fileparse($path, #suffixes);
my $filename = fileparse($path, #suffixes);
The fileparse() routine divides a file path into its $directories, $filename and (optionally) the filename $suffix.
If #suffixes are given each element is a pattern (either a string or a qr//) matched against the end of the $filename. The matching portion is removed and becomes the $suffix.
In this case, the suffix is defined as everything after and including the last period .

Perl regex extracting a match using braces

I tested the following code
#! /usr/bin/perl
use strict;
use English;
#this code extracts the current scripts filename
#by removing the path from the filepath
my $Script_Name = $PROGRAM_NAME;
${Script_Name} =~ s/^.*\\//; #windows path
#${Script_Name} =~ s/^.*\///; #Unix based path
print $Script_Name;
and i don't understand why these braces extract the match without using a /r modifier. can anyone explain why and how this works or point me to some documentation?
You're getting a little confused!
The braces make no difference. ${Script_Name} is identical to $Script_Name.
You code first copies the entire path to the script file from $PROGRAM_NAME to $Script_Name.
Then the substitution removes everything up to and including the last backslash, leaving just the file name.
The /r modifier would be used if you wanted to modify one string and put the result of the modification into another, so you could write your code in one step as
$Script_Name = $PROGRAM_NAME=~ s/^.*\\//r

How to check in perl if path is a sub directory or a file from a list of base paths (who have globed paths)

I need an optimum way to match a path with huge list of probable basepaths (which may be many directory levels up). The base paths themselves may be valid shell globbed paths
E.g.
I need to match the path /a/b-12/c/d/e/fg/hi/94.txt in the list below:
/a/b-*/e
/x/y*
/x/{a,b,s*}e
/a/{a,b*,c}2/c/d
/a/b*/c/e
...
The list is long with 10Ks of such paths. Once we have identified the probable parent paths I can verify them by chdir/cd to the path etc.
I wish to a regex match on the list but shell globbed paths are difficult for me to match. Are there Perl modules which can help me in this.
Thanks!
you may want to try Text::Glob from CPAN.
you dont have to use glob_to_regex, I was just playing with it, but this matches:
#!/usr/bin/env perl
use strict;
use warnings;
use Text::Glob qw(glob_to_regex);
my #arr = ('/a/b-12/c/d/e/fg/hi/94.txt');
my $regex = glob_to_regex('/a/{a,b*,c}2/c/d/e/fg/hi/94.txt');
print "REGEX: $regex\n";
if ($arr[0] =~ m/$regex/) {
print "Matches\n";
}

How would I sort files to directories based on filenames?

I have a huge number of files to sort all named in some terrible convention.
Here are some examples:
(4)_mr__mcloughlin____.txt
12__sir_john_farr____.txt
(b)mr__chope____.txt
dame_elaine_kellett-bowman____.txt
dr__blackburn______.txt
These names are supposed to be a different person (speaker) each. Someone in another IT department produced these from a ton of XML files using some script but the naming is unfathomably stupid as you can see.
I need to sort literally tens of thousands of these files with multiple files of text for each person; each with something stupid making the filename different, be it more underscores or some random number. They need to be sorted by speaker.
This would be easier with a script to do most of the work then I could just go back and merge folders that should be under the same name or whatever.
There are a number of ways I was thinking about doing this.
parse the names from each file and sort them into folders for each unique name.
get a list of all the unique names from the filenames, then look through this simplified list of unique names for similar ones and ask me whether they are the same, and once it has determined this it will sort them all accordingly.
I plan on using Perl, but I can try a new language if it's worth it. I'm not sure how to go about reading in each filename in a directory one at a time into a string for parsing into an actual name. I'm not completely sure how to parse with regex in perl either, but that might be googleable.
For the sorting, I was just gonna use the shell command:
`cp filename.txt /example/destination/filename.txt`
but just cause that's all I know so it's easiest.
I dont even have a pseudocode idea of what im going to do either so if someone knows the best sequence of actions, im all ears. I guess I am looking for a lot of help, I am open to any suggestions. Many many many thanks to anyone who can help.
B.
I hope I understand your question right, it's a bit ambiguous IMHO. This code is untested, but should do what I think you want.
use File::Copy;
sub sanatize {
local $_ = shift;
s/\b(?:dame|dr|mr|sir)\b|\d+|\(\w+\)|.txt$//g;
s/[ _]+/ /g;
s/^ | $//g;
return lc $_;
}
sub sort_files_to_dirs {
my #files = #_;
for my $filename (#files) {
my $dirname = sanatize($filename);
mkdir $dirname if not -e $dirname;
copy($filename, "$dirname/$filename");
}
}
Are all the current files in the same directory? If that is the case then you could use 'opendir' and 'readdir' to read through all the files one by one. Build a hash using the file name as the key (remove all '_' as well as any information inside the brackets) so that you get something like this -
(4)_mr__mcloughlin____.txt -> 'mr mcloughlin'
12__sir_john_farr____.txt -> 'sir john farr'
(b)mr__chope____.txt -> 'mr chope'
dame_elaine_kellett-bowman____.txt -> 'dame elaine kellett-bowman'
dr__blackburn______.txt -> 'dr blackburn'
Set the value of the hash to be the number of instances of the name occurred so far. So after these entries you should have a hash that looks like this -
'mr mcloughlin' => 1
'sir john farr' => 1
'mr chope' => 1
'dame elaine kellett-bowman' => 1
'dr blackburn' => 1
Whenever you come across a new entry in your hash simply create a new directory using the key name. Now all you have to do is copy the file with the changed name (use the corresponding hash value as a suffix) into the new directory. So for eg., of you were to stumble upon another entry which reads as 'mr mcloughlin' then you could copy it as
./mr mcloughlin/mr mcloughlin_2.txt
I would:
define what's significant in the name:
is dr__blackburn different than dr_blackburn?
is dr__blackburn different than mr__blackburn?
are leading numbers meaningful?
are leading/trailing underscores meaningful?
etc.
come up with rules and an algorithm to convert a name to a directory (Leon's is a very good start)
read in the names and process them one at a time
I would use some combination of opendir and recursion
I would copy them as you process them; again Leon's post is a great example
if this script will need to be maintained and used in the future, I would defintely create tests (e.g. using http://search.cpan.org/dist/Test-More/) for each regexp path; when you find a new wrinkle, add a new test and make sure it fails, then fix the regex, then re-run the test to make sure nothing broke
I've not used Perl in a while so I'm going to write this in Ruby. I will comment it to establish some pseudocode.
DESTINATION = '/some/faraway/place/must/exist/and/ideally/be/empty'
# get a list of all .txt files in current directory
Dir["*.txt"].each do |filename|
# strategy:
# - chop off the extension
# - switch to all lowercase
# - get rid of everything but spaces, dashes, letters, underscores
# - then swap any run of spaces, dashes, and underscores for a single space
# - then strip whitespace off front and back
name = File.basename(filename).downcase.
gsub(/[^a-z_\s-]+/, '').gsub(/[_\s-]+/, ' ').strip
target_folder = DESTINATION + '/' + name
# make sure we dont overwrite a file
if File.exists?(target_folder) && !File.directory?(target_folder)
raise "Destination folder is a file"
# if directory doesnt exist then create it
elsif !File.exists?(target_folder)
Dir.mkdir(target_folder)
end
# now copy the file
File.copy(filename, target_folder)
end
That's the idea, anyway - I've made sure all the API calls are correct, but this isn't tested code. Does this look like what you're trying to accomplish? Might this help you write the code in Perl?
You can split the filenames using something like
#tokens = split /_+/, $filename
The last entry of #tokens should be ".txt" for all of these filenames, but the second-to-last should be similar for the same person whose name has been misspelled in places (or "Dr. Jones" changed to "Brian Jones" for instance). You may want to use some sort of edit distance as a similarity metric to compare #tokens[-2] for various filenames; when two entries have similar enough last names, they should prompt you as a candidate for merging.
As you are asking a very general question, any language could do this as long as we have a better codification of rules. We don't even have the specifics, only a "sample".
So, working blind, it looks like human monitoring will be needed. So the idea is a sieve. Something you can repeatedly run and check and run again and check again and again until you've got everything sorted to a few small manual tasks.
The code below makes a lot of assumptions, because you pretty much left it to us to handle it. One of which is that the sample is a list of all the possible last names; if there are any other last names, add 'em and run it again.
use strict;
use warnings;
use File::Copy;
use File::Find::Rule;
use File::Spec;
use Readonly;
Readonly my $SOURCE_ROOT => '/mess/they/left';
Readonly my $DEST_DIRECTORY => '/where/i/want/all/this';
my #lname_list = qw<mcloughlin farr chope kelette-bowman blackburn>;
my $lname_regex
= join( '|'
, sort { ( $b =~ /\P{Alpha}/ ) <=> ( $a =~ /\P{Alpha}/ )
|| ( length $b ) <=> ( length $a )
|| $a cmp $b
} #lname_list
)
;
my %dest_dir_for;
sub get_dest_directory {
my $case = shift;
my $dest_dir = $dest_dir_for{$case};
return $dest_dir if $dest_dir;
$dest_dir = $dest_dir_for{$case}
= File::Spec->catfile( $DEST_DIRECTORY, $case )
;
unless ( -e $dest_dir ) {
mkdir $dest_dir;
}
return $dest_dir;
}
foreach my $file_path (
File::Find::Rule->file
->name( '*.txt' )->in( $SOURCE_ROOT )
) {
my $file_name = [ File::Spec->splitpath( $file_path ) ]->[2];
$file_name =~ s/[^\p{Alpha}.-]+/_/g;
$file_name =~ s/^_//;
$file_name =~ s/_[.]/./;
my ( $case ) = $file_name =~ m/(^|_)($lname_regex)[._]/i;
next unless $case;
# as we next-ed, we're dealing with only the cases we want here.
move( $file_path
, File::Spec->catfile( get_dest_directory( lc $case )
, $file_name
)
);
}