Perl Regular expression to replace the last matching string - regex

I have a string as below:
$str = "/dir1/dir2/dir3/file.txt"
I want to remove the /file.txt from this string.
So that the $str will become.
$str = "/dir1/dir2/dir3"
I am using the following regex. But it is replacing everything.
$str =~ s/\/.*\.txt//;
How can I make regex to look for last '/' instead of first.
What is the correct regular expression for this?
Please note that file.txt is not fixed name. It can be anything like file1.txt, file2.txt, etc.

If you want to get the path from that string, you can use File::Basename. It is a core module since Perl version 5.
perl -MFile::Basename -le '$str = "/dir2/dir3/file.txt"; print dirname($str);'
In script form:
use strict;
use warnings; # always use these
use File::Basename;
my $str = "/dir1/dir2/dir3/file.txt";
print dirname($str);"
Your regex does not work because it is not anchored, and .* is greedy, so it matches as much as it can, starting from the first slash / it encounters. A working regex would look something like these:
$str =~ s#/[^/]*?\.txt$##;
Note the use of a non-greedy quantifier *?, which will match smallest possible string. Also note that I use another delimiter for the substitution to avoid the "leaning toothpick syndrome", e.g. s/\/\/\///.

Very simple regex : s/\/[^\/]*$//

In this regex
m/(.*)\/[^\/]*$/
the first submatch is the path you are looking for.
EDIT:
If you are looking for substitution user1215106's soultion is the way to go:
s/\/[^\/]*$//

Related

How can I use regex to remove /1 or /2?

Regex gurus,
Here is the following line of code I want to parse with regex:
#ERR030882.2595 HWI-BRUNOP16X_0001:3:1:6649:5175#0/1
I want to obtain the following:
#ERR030882.2595 HWI-BRUNOP16X_0001:3:1:6649:5175#0
I have written the following regex on rubular.com:
(#.* *.)(!?(\/.))
My idea is to use negation to remove /1 by (!?(\/.)). However, this produces the entire line?
#ERR030882.2595 HWI-BRUNOP16X_0001:3:1:6649:5175#0/1
Why is (?!thisismystring) not removing /1? I googled the fire out of this, but they seemed to suggest similar things I am already trying? I deeply appreciate your help.
I think what you are trying to write is /(\#.* .*)(?=\/\d)/ (you need to escape the at sign # to prevent Perl from treating it as an array) but you need a positive look-ahead because you want to match everything up until the following characters are a slash followed by a digit.
Here is a program that demonstrates.
use strict;
use warnings;
use 5.010;
my $s = '#ERR030882.2595 HWI-BRUNOP16X_0001:3:1:6649:5175#0/1';
$s =~ /(\#.* .*)(?=\/.)/;
print $1, "\n";
But you would be much better off copying the whole string and removing the slash and everything after it, like this
use strict;
use warnings;
my $s = '#ERR030882.2595 HWI-BRUNOP16X_0001:3:1:6649:5175#0/1';
(my $fixed = $s) =~ s{/\d+$}{};
print $fixed, "\n";
output
#ERR030882.2595 HWI-BRUNOP16X_0001:3:1:6649:5175#0

Opposite of (foo|bar|baz)

I'd like a regex to match everything but a few specific options within a broader expression.
The following example will match test_foo.pl or test_bar.pl or test_baz.pl:
/test_(foo|bar|baz)\.pl/
But I'd like just the opposite:
match test_.*\.pl except for where .* = (foo|bar|baz)
I'm kind of limited in my options for this because this is not directly into a perl program, but an argument to cloc, a program that counts lines of code (that happens to be written in perl). So I'm looking for an answer that can be done in one regex, not multiple chained together.
You should be able to accomplish this by using a negative lookahead:
/test_(?!foo|bar|baz).*\.pl/
This will fail if foo, bar, or baz immediately follows test_.
Note that this could still match something like test_notfoo.pl, and would fail on test_fool.pl, if you do not want this behavior please clarify by adding some examples of what exactly should and should not match.
If you want to accept something like test_fool.pl or test_bart.pl, then you could change it to the following:
/test_(?!(foo|bar|baz)\.pl).*\.pl/
#!/usr/bin/env perl
use strict; use warnings;
my $pat = qr/\Atest_.+(?<!foo|bar|baz)[.]pl\z/;
while (my $line = <DATA>) {
chomp $line;
printf "%s %s\n", $line, $line =~ $pat ? 'matches' : "doesn't match";
}
__DATA__
test_bar.pl
test_foo.pl
test_baz.pl
test baz.pl
0test_bar.pl
test_me.pl
test_me_too.txt
Output:
test_bar.pl doesn't match
test_foo.pl doesn't match
test_baz.pl doesn't match
test baz.pl doesn't match
0test_bar.pl doesn't match
test_me.pl matches
test_me_too.txt doesn't match
(?:(?!STR).)*
is to
STR
as
[^CHAR]
is to
CHAR
So you want
if (/^test_(?:(?!foo|bar|baz).)*\.pl\z/s)
More readable:
my %bad = map { $_ => 1 } qw( foo bar baz );
if (/^test_(.*)\.pl\z/s && !$bad{$1})
Hmm, I might have misunderstood your question. Anyway, maybe this is helpful ...
You would negate the match operator. For example:
perl -lwe "print for grep ! m/(lwp|archive).*\.pl/, glob q(*.pl)"
# Note you'd use single-quotes on Linux but double-quotes on Windows.
# Nothing to do with Perl, just different shells (bash vs cmd.exe).
The ! negates the match. The above is shorthand for:
perl -lwe "print for grep ! ($_ =~ m/(lwp|archive).*\.pl/), glob q(*.pl)"
Which can also be written using the negated match operator !~, as follows:
perl -lwe "print for grep $_ !~ m/(lwp|archive).*\.pl/, glob q(*.pl)"
In case you're wondering, the glob is simply used to get an input list of filenames as per your example. I just substituted another match pattern suitable for the files I had handy in a directory.

How can I extract a substring up to the first digit?

How can I find the first substring until I find the first digit?
Example:
my $string = 'AAAA_BBBB_12_13_14' ;
Result expected: 'AAAA_BBBB_'
Judging from the tags you want to use a regular expression. So let's build this up.
We want to match from the beginning of the string so we anchor with a ^ metacharacter at the beginning
We want to match anything but digits so we look at the character classes and find out this is \D
We want 1 or more of these so we use the + quantifier which means 1 or more of the previous part of the pattern.
This gives us the following regular expression:
^\D+
Which we can use in code like so:
my $string = 'AAAA_BBBB_12_13_14';
$string =~ /^\D+/;
my $result = $&;
Most people got half of the answer right, but they missed several key points.
You can only trust the match variables after a successful match. Don't use them unless you know you had a successful match.
The $&, $``, and$'` have well known performance penalties across all regexes in your program.
You need to anchor the match to the beginning of the string. Since Perl now has user-settable default match flags, you want to stay away from the ^ beginning of line anchor. The \A beginning of string anchor won't change what it does even with default flags.
This would work:
my $substring = $string =~ m/\A(\D+)/ ? $1 : undef;
If you really wanted to use something like $&, use Perl 5.10's per-match version instead. The /p switch provides non-global-perfomance-sucking versions:
my $substring = $string =~ m/\A\D+/p ? ${^MATCH} : undef;
If you're worried about what might be in \D, you can specify the character class yourself instead of using the shortcut:
my $substring = $string =~ m/\A[^0-9]+/p ? ${^MATCH} : undef;
I don't particularly like the conditional operator here, so I would probably use the match in list context:
my( $substring ) = $string =~ m/\A([^0-9]+)/;
If there must be a number in the string (so, you don't match an entire string that has no digits, you can throw in a lookahead, which won't be part of the capture:
my( $substring ) = $string =~ m/\A([^0-9]+)(?=[0-9])/;
$str =~ /(\d)/; print $`;
This code print string, which stand before matching
perl -le '$string=q(AAAA_BBBB_12_13_14);$string=~m{(\D+)} and print $1'
AAAA_BBBB_

What does s-/-- and s-/\Z-- in perl mean?

I am a beginner in perl and I have a query regarding pattern matching.
I came across a line in perl where it was written
$variable =~ s-/\Z--;
And as the code goes ahead some another variable was assigned
$variable1 =~ s-/--;
Can you please tell me what does these 2 lines do?
I want to know what does s-/\Z-- and s-/-- mean.
$variable =~ s-/\Z--;
- is used as a delimiter here. However, best practice suggests that you either use / or {} as delimiters.
It could be re-written as:
$variable =~ s{/\Z}{}; # remove a / at the end of a string
Consider:
$variable1 =~ s-/--;
Again, it could be re-written as:
$variable1 =~ s{/}{}; # remove the first /
The s/// operator in Perl is a substitution operation, which performs a search-and-replace on a string using a special kind of pattern called a regular expression. You can read more about regular expressions and Perl's pattern matching in the man pages that come with Perl:
man perlretut
man perlre
If you don't have these on your system, try searching Google for the same.
Applying a substitution to a variable is done with the =~ operator. So the following replaces all instances of 'foo' in the variable $var with 'bar'.
$var =~ s/foo/bar/;
All the Perl operators are documented on the 'perlop' man page.
Even though the most common separator character is a slash (hence s///), you can also use any other punctuation character as a separator. So in this case, the author has decided to use the dash (-) as the separator.
Here's the same line of code above using dash as a separator:
$var =~ s-foo-bar-;
In your case, the dash doesn't seem to add any clarity to the code, so it might be best to update it to use the conventional slashes instead.
The s/// search and replace function in perl can be used with different delimeters, which is what is done in this case. They have replaced / with the minus sign -, or dash.
The s-/-- removes the first / from the string.
The s-/\Z-- matches and removes a slash at the end of the line. I think this is better written: s{/$}{}.
$variable1 =~ s-/--;could be written as
$variable =~ s{/}{}xms;
or this
$variable =~ s/ \/ //xms;
It means delete the first / in the string.
Regarding s-/\Z--, it is usually written like this
$variable =~ s{/ \Z}{}xms;
or this
$variable =~ s/ \/ \Z //xms;
It means delete a / if it is at the end of the string (\Z).

In Perl, how can I correctly extract URLs that are enclosed in parentheses?

I've got two question about Regexp::Common qw/URI/ and Regex in Perl.
I use Regexp::Common qw/URI/ to parse URI in the strings and delete them. But I've got an error when a URI is between parentheses.
For example: (http://www.example.com)
The error is caused by ')', and when it try to parse the URI, the app crash. So I've thought two fixes:
Do a simple (or I thought so) that writes a whitespace between parentheses and ) characters
The Regexp::Common qw/URI/ has a function that implement a fix.
In my code I've tried to implement the Regex but the app freezes. The code that I've tried is this:
use strict;
use Regexp::Common qw/URI/;
my $str = "Hello!!, I love (http://www.example.com)";
while ($str =~ m/\)/){
$str =~ s/\)/ \)/;
}
my ($uri) = $str =~ /$RE{URI}{-keep}/;
print "$uri\n";
print $str;
The output that I want is: (http://www.example.com )
I'm not sure, but I think that the problem is in $str =~ s/\)/ \)/;
BTW, I've got a question about Regexp::Common qw/URI/. I've got two string type:
ablalbalblalblalbal http://www.example.com
asfasdfasdf http://www.example.com aasdfasdfasdf
I want to remove the URI if it is the last component (and save it). And, if not, save it without removing it from the text.
You don't have to first test for a match to be able to use the s/// operator correctly: If the string does not match the search pattern, it will not do anything.
#!/usr/bin/perl
use strict; use warnings;
my $str = "Hello!!, I love (GOOGLE)";
$str =~ s/\)/ )/g;
print "$str\n";
The general problem of detecting URLs correctly in text is error-prone. See for example Jeff's thoughts on this.
my $str = "Hello!!, I love (GOOGLE)";
while ($str =~ m/)/){
$str =~ s/)/ )/;
}
Your program goes into an infinite loop at this point. To see why, try printing the value of $str each time round the loop.
my $str = "Hello!!, I love (GOOGLE)";
while ($str =~ m/)/){
$str =~ s/)/ )/;
print $str, "\n";
}
The first time it prints "Hello!!, I love (GOOGLE )". The while loop condition is then evaluated again. Your string still matches your regular expression (it still contains a closing parenthesis) so the replacement is run again and this time it prints out "Hello!!, I love (GOOGLE )" with two spaces.
And so it goes on. Each time round the loop another space is added, but each time you still have a closing parenthesis, so another substitution is run.
The simplest solution I can see is to only match the closing parenthesis if it is preceded by a non-whitespace character (using \S).
my $str = "Hello!!, I love (GOOGLE)";
while ($str =~ m/\S)/){
$str =~ s/)/ )/;
print $str, "\n";
}
In this case the loop is only executed once.
Why not just include the parentheses in the search? If the URLs will always be bracketed, then something like this:
#!/usr/bin/perl
use warnings;
use strict;
use Regexp::Common qw/URI/;
my $str = "Hello!!, I love (http://www.google.com)";
my ($uri) = $str =~ / \( ( $RE{URI} ) \) /x;
print "$uri\n";
The regex from Regex::Common can be used as part of a longer regex, it doesn't have to be used on its own. Also I've used the 'x' modifier on the regex to allow whitespace so you can see more clearly what is going on - the brackets with the backslashes are treated as characters to match, those without define what is to matched (presumably like the {-keep} - I've not used that before).
You could also make the brackets optional, with something like:
/ (?: \( ( $RE{URI} ) \) | ( $RE{URI} ) ) /
although that would result in two match variables, one undefined - so something like following would be needed:
my $uri = $1 || $2 || die "Didn't match a URL!";
There's probably a better way to do this, and also if you're not bothered about matching parentheses then you could simply make the brackets optional (via a '?') in the first regex...
To answer your second question about only matching URLs at the end of the line - have a look at Regex 'anchors' which can force a match against the beginning or end of a line: ^ and $ (or \A and \Z if you prefer). e.g. matching a URL at the end of a line only:
/$RE{URI}\Z/