Getting just the file name from full path - regex

I need to get from a full file path just the name of the file. I've tried to use:
$out_fname =~ s/[\/\w+\/]+//;
but it "eats up" also purts of the file name.
example:
for a file:
/bla/bla/folder/file.part.1.file,
it returned:
.part.1,file

You can do:
use File::Basename;
my $path = "/bla/bla/folder/file.part.1.file";
my $filename = basename($path);

Besides File::Basename, there's also Path::Class, which can be handy for more complex operations, particularly when dealing with directories, or cross-platform/filesystem operations. It's probably overkill in this case, but might be worth knowing about.
use Path::Class;
my $file = file( "/bla/bla/folder/file.part.1.file" );
my $filename = $file->basename;

I agree with the other answers, but just wanted to explain the mistake in your pattern. Regex is tricky, but worth it to learn well.
The square brackets defines a class of objects that will match. In your case, it will match with the forward slash, a word character (from the \w), the + character, or the forward slash character (this is redundant). Then you are saying to match 1 or more of those. There are multiple strings that could match. It will match the earliest starting character, so the first /. Then it will grab as much as possible.
This is not what you intended clearly. For example, if you had a . in one of your directory names, you would stop there. /blah.foo/bar/x.y.z would return .foo/bar/x.y.z.
The way to think of this is that you want to match all characters up to and including the final /.
All characters then slash: /.*\//
But to be safer, add a caret at front to make sure it starts there: /^.*\//
And to allow forward and backslashes, make a class for that: /^.*[\/\\]/ (i.e. elusive's answer).
A really good reference is Learning Perl. There are about 3 really good regex chapters. They are applicable to non-Perl regex users as well.

Using split on the directory separator is another alternative. This has the same caveats as using a regex (i.e. with filenames it's better to use a module where someone else has already thought about edge cases, portability, different filesystems, etc, and so you don't need matching on both back- and forward-slashes), but useful as another general technique where you have a string with a repeated separator.
my $file = "/bla/bla/folder/file.part.1.file";
my #parts = split /\//, $file;
my $filename = $parts[-1];

This is exactly what I would expect it to retain in the given substitution. You are saying replace the longest string of slashes and word characters with nothing. So it grabs all the characters up until the first character you didn't specify and deletes them.
It's doing what you are asking it to do. I join with others in saying use File::Basename for what you are trying to do.
But here is the quickest way to do the same thing:
my $fname = substr( $out_fname, rindex( $out_fname, '/' ) + 1 );
Here, it says find the last occurrence of '/' in the string and give me the text starting one after that position. I'm not anti-regex by any stretch, but it's a simple expression of what you actually want to do. I've had to do stuff like this for so long, I wrote a last_after sub:
sub last_after {
my ( $string, $delim ) = #_;
unless ( length( $string ) and my $ln = length( $delim )) {
return $string // '';
}
my $ri = rindex( $string, $delim );
return $ri == -1 ? $string : substr( $string, $ri + $ln );
}

I also needed to pull just the last field from a bunch of path names. This worked for me:
grep -o '/\([^/]*\)$' inputfile > outputfile

What about this:
$out_fname =~ s/^.*[\/\\]//;
It should remove everything in front of your filename.

Related

Perl split string based on forward slash

I am new to Perl, so this is basic question. I have a string as shown below. I am interested in taking date out of it, so thinking of splitting it using slash
my $path = "/bla/bla/bla/20160306";
my $date = (split(/\//,$path))[3];#ideally 3 is date position in array after split
print $date;
However, I don't see the expected output, but instead I see 5 getting printed.
Since the path starts with the pattern / itself, split returns a list with an empty string first (to the left of the first /); one element more. Thus the posted code miscounts by one and returns the one before last element (subdirectory) in the path, not the date.
If date is always the last thing in the string you can pick the last element
my $date = (split '/', $path)[-1];
where i've used '' for delimiters so to not have to escape /. (This, however, may confuse since the separator pattern is a regex and // convey that, while '' may appear to merely quote a string.)
This can also be done with regex
my #parts = $path =~ m{([^/]+)}g;
With this there can be no inital empty string. Or, the last part can be picked out of the full list as above, with ($path =~ m{...}g)[-1], but if you indeed only need the last bit then extract it directly
my ($last_part) = $path =~ m{.*/(.*)};
Here the "greedy" .* matches everything in the string up to the last instance of the next subpattern (/ here), thus getting us to the last part of the path, which is then captured. The regex match operator returns its matches only when it is in the list context so parens on the left are needed.
What brings us to the fact that you are parsing a path, and there are libraries dedicated to that.
For splitting a path into its components one tool is splitdir from File::Spec
use File::Spec;
my #parts = File::Spec->splitdir($path);
If the path starts with a / we'll again get an empty string for the first element (by design, see docs). That can then be removed, if there
shift #parts if $parts[0] eq '';
Again, the last element alone can be had like in the other examples.
Simply bind it to the end:
(\d+)$
# look for digits at the end of the string
See a demo on regex101.com. The capturing group is only for clarification though not really needed in this case.
In Perl this would be (I am a PHP/Python guy, so bear with me when it is ugly)
my $path = "/bla/bla/bla/20160306";
$path =~ /(\d+)$/;
print $1;
See a demo on ideone.com.
Try this
Use look ahead for to do it. It capture the / by splitting. Then substitute the data using / for remove the slash.
my $path = "/a/b/c/20160306";
my $date = (split(/(?=\/)/,$path))[3];
$date=~s/^\///;
print $date;
Or else use pattern matching with grouping for to do it.
my $path = "/a/b/c/20160306";
my #data = $path =~m/\/(\w+)/g;
print $data[3];

In Perl, what is the meaning of if (s/^\+//)?

In a Perl/Tk code I found a conditional statement as below
if (s/^\+//)
{
#do something
}
elsif (/^-/)
{
#do another thing
}
Seems like some pattern matching has been done. But I cannot understand it. Can anyone help me understanding this pattern matching?
They are both regular expressions. You can read up on them at perlre and perlretut. You can play around with them on http://www.rubular.com.
They both implicitly do something with $_. There probably is a while or foreach around your lines of code without a loop variable. In that case, $_ becomes that loop variable. It might for instance contain the current line of a file that is being read.
If the current value of $_ contains a + (plus) sign as the first character at the beginning of the string, #do somehting.
Else if it contains a - (minus) sign, #do another thing.
In case 1. it also replaces that + sign with nothing (i.e. removes it). It does not remove the - in 2. however.
Let's look at an explanation with YAPE::Regex::Explain.
use YAPE::Regex::Explain;
print YAPE::Regex::Explain->new(qr/^\+/)->explain();
Here it is. Not really helpful in our case, but a nice tool nonetheless. Note that the (?-imsx and ) parts are the default things Perl implies. They are always there unless you change them.
The regular expression:
(?-imsx:^\+)
matches as follows:
NODE EXPLANATION
----------------------------------------------------------------------
(?-imsx: group, but do not capture (case-sensitive)
(with ^ and $ matching normally) (with . not
matching \n) (matching whitespace and #
normally):
----------------------------------------------------------------------
^ the beginning of the string
----------------------------------------------------------------------
\+ '+'
----------------------------------------------------------------------
) end of grouping
----------------------------------------------------------------------
Update: As Mikko L in the comments pointed out, you should maybe refactor/change this piece of code. While it probably does what it is supposed to, I believe it would be a good idea to make it more readable. Whoever wrote it obviously didn't care about you as the later maintainer. I suggest you do. You could change it to:
# look at the content of $_ (current line?)
if ( s/^\+// )
{
# the line starts with a + sign,
# which we remove!
#do something
}
elsif ( m/^-/ )
{
# the line starts witha - sign
# we do NOT remove the - sign!
#do another thing
}
Those are regular expressions, used for pattern matching and substitution.
You should read up on the concept, but as for your question:
s/^\+//
If the string started with a plus, remove that plus (the "s" means "substitute"), and return true.
/^-/
True if the string starts with a minus.
This code is equivalent to
if ($_ =~ s/^\+//) { # s/// modifies $_ by default
#do something
}
elsif ($_ =~ m/^-/) { # m// searches $_ by default
#do another thing
}
s/// and m// are regexp quote-like operators. You can read about them in perlop.
The other answers have given a summary of how the code works, but not much of why. Here is a simple example of why one might use such logic.
#!/usr/bin/env perl
use strict;
use warnings;
my $args = {};
for ( #ARGV ) {
if ( s/^no// ) {
$args->{$_} = 0;
} else {
$args->{$_} = 1;
}
}
use Data::Dumper;
print Dumper $args;
When you call the script like
./test.pl hi nobye
you get
$VAR1 = {
'hi' => 1,
'bye' => 0
};
The key is the string, however if it is preceded by no then remove it (to get the key in question) and instead set the value to 0.
The OP's example is a little more involved, but follows the same logic.
if the key starts with a +, remove it and do something
if the key starts with a -, don't remove it and do something else

regular expression url rewrite based on folder

I need to be able to take /calendar/MyCalendar.ics where MyCalendar.ics coudl be about anything with an ICS extention and rewrite it to /feeds/ics/ics_classic.asp?MyCalendar.ics
Thanks
Regular expressions are meant for searching/matching text. Usually you will use regex to define what you search for some text manipulation tool, and then use a tool specific way to tell the tool with what to replace the text.
Regex syntax use round brackets to define capture groups inside the whole search pattern. Many search and replace tools use capture groups to define which part of the match to replace.
We can take the Java Pattern and Matcher classes as example. To complete your task with the Java Matcher you can use the following code:
Pattern p = Pattern.compile("/calendar/(.*\.(?i)ics)");
Matcher m = p.matcher(url);
String rewritenUrl = "";
if(m.matches()){
rewritenUrl = "/feeds/ics/ics_classic.asp?" + url.substring( m.start(1), m.end(1));
}
This will find the requested pattern but will only take the first regex group for creating the new string.
Here is a link to regex replacement information in (imho) a very good regex information site: http://www.regular-expressions.info/refreplace.html
C:\x>perl foo.pl
Before: a=/calendar/MyCalendar.ics
After: a=/feeds/ics/ics_classic.asp?MyCalendar.ics
...or how about this way?
(regex kind of seems like overkill for this problem)
b=/calendar/MyCalendar.ics
index=9
c=MyCalendar.ics (might want to add check for ending with '.ics')
d=/feeds/ics/ics_classic.asp?MyCalendar.ics
Here's the code:
C:\x>type foo.pl
my $a = "/calendar/MyCalendar.ics";
print "Before: a=$a\n";
my $match = (
$a =~ s|^.*/([^/]+)\.ics$|/feeds/ics/ics_classic.asp?$1.ics|i
);
if( ! $match ) {
die "Expected path/filename.ics instead of \"$a\"";
}
print "After: a=$a\n";
print "\n";
print "...or how about this way?\n";
print "(regex kind of seems like overkill for this problem)\n";
my $b = "/calendar/MyCalendar.ics";
my $index = rindex( $b, "/" ); #find last path delim.
my $c = substr( $b, $index+1 );
print "b=$b\n";
print "index=$index\n";
print "c=$c (might want to add check for ending with '.ics')\n";
my $d = "/feeds/ics/ics_classic.asp?" . $c;
print "d=$d\n";
C:\x>
General thoughts:
If you do solve this with a regex, a semi-tricky bit is making sure your capture group (the parens) exclude the path separator.
Some things to consider:
Are your paths separators always forward-slashes?
Regex seems like overkill for this; the simplest thing I can think of of to get the index of your last path separator and do simple string manipulation (2nd part of sample program).
Libraries often have routines for parsing paths. In Java I'd look at the java.io.File object, for example, specifically
getName()
Returns the name of the file or directory denoted by
this abstract pathname. This is just the last name in
the pathname's name sequence

How to use Perl to perform substitutions based on calculations?

I'm trying to write a perl script to search through a text file, find all decimal numbers, and modify them by some scaling factor. So far, I have succeeded in extracting the numbers with regular expressions:
open(INPUT, $inputPath) or die "$inputPath cannot be opened.";
while ($thisLine = <INPUT>) {
while ($thisLine =~ m/(-*\d+\.\d+)/g) {
if(defined($1)) {
$new = $scalingFactor*$1;
print $new."\n";
}
}
}
close (INPUT);
However, I haven't yet figured out how to reinsert the new values into the file. I tried using s/(-*\d+.\d+)/$scalingFactor*$1/g for substitution, but of course this inserted the string representation of $scalingFactor instead of evaluating the expression.
I'm a perl newbie, so any help would be greatly appreciated. Thanks in advance,
-Dan
Edit: Solution (based on Roman's Reply)
while ($thisLine = <INPUT>) {
$thisLine =~ s/(-*\d+\.\d+)/$scalingFactor*$1/ge;
prinf OUTPUT $thisLine;
}
Alternatively, Sean's solution also worked great for me. Thanks all!
s/(-*\d+.\d+)/$scalingFactor*$1/ge
(notice e in the end)
Here's a self-contained subroutine that'll do the job. It uses the special variable $^I, which activates Perl's in-place editing feature. (See the "perlvar" man page for more information on $^I, and the "perlrun" man page for information about the -i command-line switch, which turns on in-place editing.)
use strict; # Always.
sub scale_numbers_in_file_by_factor {
my ($path, $scaling_factor) = #_;
local #ARGV = ($path);
local $^I = '.bak';
while (<>) {
s/ ( -? \d+ \. \d+ ) / $scaling_factor * $1 /gex;
print;
}
}
scale_numbers_in_file_by_factor('my-file.txt', .1);
A backup file will be made by appending '.bak' to the original filename. Change the '.bak' to '' above if you don't want a backup.
You might want to tweak your number-recognizing regular expression. As written, it will not match numbers without a trailing decimal point and at least one digit. I think you also want -? to match an optional minus sign, not -*, which will match any number of minus signs. Performing arithmetic on a string with more than one leading minus sign will almost certainly not do what you want.

How can I validate a filename with only eight digits and an extension, in Perl?

Perl (no modules loaded and -Tw & strict)
I found much info on regex and pattern matching here but not exactly what I need.
I want to know if this is the correct way to validate a couple things.
Sorry about the beginner attempt here. I am quite new to this.
my $this = "12345678";
if ($this != m/\b[0-9]{8}\b/x) { print "$this is bad"; }
my $that = "12345678.gif";
if ($that != m/\b[0-9]{8}\.gif\b/x) { print "$that is bad"; }
or
if ($that != m/\b[0-9]{8}\.(jpe?g|gif|png)\b/x) { print "$that is bad"; }
my ($ext) = $that =~ m/\.([^\.]+)$/x;
# verify extension
if ($ext != m/\.(jpe?g|png|gif)$/x ){ print "$ext is bad"; }
# for content type
if ($ext eq "jpg") {$ext = "jpeg";}
I use the /x because perl::critic indicated I needed it. Passes with /x so...
\d is not an option and should be avoided from what I have read here.
The file name "that" has to be 8 digits + an image type. The other set of digits "this" is actually a folder name. This provides a little error checking for an image serving script. The root htaccess sends calls to images in a specific folder to said script. I grab the directory and image name off via path info.
If you are running under taint checking, this isn't the way to do it. You need to match the pattern you need then remember that in a memory variable to clear the taint:
my $this = ...;
my $regex = qr/
^ # beginning of string
( # start of $1
[0-9]{8}
\.
(gif|jpg) # extension in $2
)
\z #end of string
/x;
my( $cleansed, $extension ) = do {
if( $this =~ m/$regex/ ) { ( $1, $2 ) }
else { die "Bad filename!" }
};
I'm not sure why you have a \b at the beginning of your regex. It probably doesn't do what you think it does. If you want the file name to be only the digits, use the ^ beginning of string anchor instead. That way, nothing can come before the digits. Likewise, the end of string anchor \z says that nothing can come after the extension.
If you then need to match an extension to a content-type for an HTTP response, which I'm guessing that your doing, you can use a hash to make the map:
my %types = (
jpg => jpeg,
gif => gif,
...
);
Now that you have the hash, you can use it as another level of validation:
unless( exists $types{$extension} ) { die "Unsupported type!" }
Most of what you have above looks fine. A few points:
if ($ext != m/pattern/) is wrong - the != operator should be !~
\d is just fine, if you're parsing filenames or anything else not likely to be unicode
avoid /x unless you really need it (you're not splitting the regexp up over multiple lines for readability). For that matter, avoid any of the flags unless you need them
(jpe?g|gif|png) can be modified to (?:jpe?g|gif|png) to disable capturing on that set of parentheses (the efficiency increase is nominal in most cases, but it occasionally can make a difference e.g. in a rapid loop, so I make it a habit to not capture unless I need to)
you don't need to escape . inside a character class - i.e. [^\.] can be [^.] (I believe the only character you need to escape is ] itself, but don't take this as gospel) :)
it's "Perl", not "PERL" :)
You need to use =~ and !~ instead of == and != for regex matching. Also after removing redundant code and optimizing, I would write it this way.
my $that = "12345678.gif";
if ($that =~ m/\b[0-9]{8}\.(jpe?g|gif|png)\b/x)
{
my $ext = $1;
if ($ext eq "jpg") {$ext = "jpeg";}
}
else
{
print "$that is bad";
}