Perl regex matching alternative file names - regex

Example:
I used to use regex to get extension from the file name:
my $name = "file.zip";
my ($fname, $fext) = $name =~ /(.*)\.(.*)/;
# file
# zip
Now, I need to make sure that it also properly catches .tar.gz files, in case name includes it, otherwise falls back to example above. I did the following:
my $name = "file.tar.gz";
my ($fname, $fext) = $name =~ /(.*)\.(tar\.gz$)|(.*)\.(.*)/;
# file
# tar.gz
Problem:
The problem is that now it only works for file.tar.gz and doesn't fall back to catching regular files, like file.zip, and returns empty, in the second case.
How do I do this in one regex, so it successfully works for file.tar.gz and file.zip. What do I miss?

You may use
/^(.*?)\.(tar\.gz|[^.]*)$/
Details
^ - start of a line
(.*?) - Group 1: any 0+ chars other than line break chars, as few as possible
\. - a dot
(tar\.gz|[^.]*) - Group 2: tar.gz or any 0+ chars other than a dot
$ - end of line.
See the regex demo.
Alternatively, you might also use your original pattern but wrap it with a branch reset group:
/(?|(.*)\.(tar\.gz)|(.*)\.(.*))$/
See this regex demo. It will assign the same IDs to the corresponding capturing groups inside the branch reset group. Since (.*)\.(tar\.gz) will be tried first, if there is a string ending with .tar.gz, the first alternation part ((.*)\.(tar\.gz)) will match, else, the second one ((.*)\.(.*)) will consume the string.

perl -e '$name= "file.zip";($fname,$fext)=$name =~ /(.*)\.(tar\.gz|zip)$/ ;print "$fname.$fext"'
file.zip
the number of captured group, ie. 4, is greater than that of assigned variable ($fname,$fext)= , ie. 2
only the first 2 group is assigned

Related

How to use regex in notepad ++ to search for emails with specific domains

I am trying to use Notepad ++ to delete emails that end in #domain2.serverdata.net
here is a string example:
smtp:name#domain1.com;SMTP:name#domain2.com;smtp:name#domain2.serverdata.net;smtp:name#domain3.com;smtp:name_e4d1fe3d-e985-40d0-bc65-32c57c9b14d1#domain2.serverdata.net
I was hoping to use:
;smtp:.*#domain2.serverdata.net
but it captures SMTP:name#domain2.com as well
Ctrl+H
Find what: (?:\A|;)smtp:[^#]*#domain2\.serverdata\.net
Replace with: LEAVE EMPTY
check Wrap around
check Regular expression
Replace all
Explanation:
(?:\A|;) # non capture group, beginning of file or semicolon,
this allows to delete the first email of the file
that haven't a semicolon before it
smtp: # literally
[^#]+ # 1 or more any character that is not #
#domain2\.serverdata\.net # literally
Try Regex: ;?smtp:[\w.-]+?#domain2\.serverdata\.net
Demo
Regexes will usually capture as much as possible. For instance: START.*STOP applied to the following text:
STARTsghlegdSTOPfsgikbSTARTsvdinusSTOPwegtgw
will capture this part:
STARTsghlegdSTOPfsgikbSTARTsvdinusSTOPwegtgw
^------------------------------------^
In your case, the .* captures everything up to the last instance of #domain2.serverdata.net. You don't want to use . (any character), you want to use "any character except '#'" which is written like this: [^#].
So your full regex would be smtp:[^#]*#domain2\.serverdata\.net. I also dropped the initial ; since it would prevent you from capturing the first mail address.
Try this one:
smtp:[^#]+#domain2\.serverdata\.net(;)?

Regex - match up to first literal

I have some lines of code I am trying to remove some leading text from which appears like so:
Line 1: myApp.name;
Line 2: myApp.version
Line 3: myApp.defaults, myApp.numbers;
I am trying and trying to find a regex that will remove anything up to (but excluding) myApp.
I have tried various regular expressions, but they all seem to fail when it comes to line 3 (because myApp appears twice).
The closest I have come so far is:
.*?myApp
Pretty simple - but that matches both instances of myApp occurrences in Line 3 - whereas I'd like it to match only the first.
There's a few hundred lines - otherwise I'd have deleted them all manually by now.
Can somebody help me? Thanks.
You need to add an anchor ^ which matches the starting point of a line ,
^.*?(myApp)
DEMO
Use the above regex and replace the matched characters with $1 or \1. So that you could get the string myApp in the final result after replacement.
Pattern explanation:
^ Start of a line.
.*?(myApp) Shortest possible match upto the first myApp. The string myApp was captured and stored into a group.(group 1)
All matched characters are replaced with the chars present inside the group 1.
Your regular expression works in Perl if you add the ^ to ensure that you only match the beginnings of lines:
cat /tmp/test.txt | perl -pe 's/^.*?myApp/myApp/g'
myApp.name;
myApp.version
myApp.defaults, myApp.numbers;
If you wanted to get fancy, you could put the "myApp" into a group that doesn't get captured as part of the expression using (?=) syntax. That way it doesn't have to be replaced back in.
cat /tmp/test.txt | perl -pe 's/^.*?(?=myApp)//g'
myApp.name;
myApp.version
myApp.defaults, myApp.numbers;

Perl regular expression to modify a filename

I am trying to remove the Ist part and last part of the file name.
I have a filename like /external/prop/new/test/File.Name.will.BE.this.extension.date
I want to remove the first part of the directory (/external) and the last part of the filename extension (.date) so my output file name would be /prop/new/test/File.Name.will.BE.this.extension
eg:
OLD FILE name: /external/prop/new/test/FACL.Prop.for.BBG.txt.09242012
NEW FILENAME: /prop/new/test/FACL.Prop.for.BBG.txt
OLD FILE name: /external/prop/old/test/set2/FACL.Prop.FITCH.csv.09242012
NEW FILENAME: /prop/old/test/set2/FACL.Prop.FITCH.csv
I had tried something like
my($FinalName, $Finalextension) = split /\.(?!.*\.)/, substr($Fname,$Flength);
but it is not quite helpful.
/external will always remain the same but the date will always vary and I can't just remove the numbers as the .extension can be numbers.
$Fname =~ s|^/external||; # Remove leading "external" from path
$Fname =~ s|\.\d{8}$||; # Remove date from end of filename
my $qfn = '/external/prop/new/test/FACL.Prop.for.BBG.txt.09242012';
$qfn =~ s{\.[^.]*\z}{}; # Remove last "extension".
$qfn =~ s{^/[^/]+}{}; # Remove first part of path.
Try this regex which captures the text you need in $1:
^\/[^/]+(.*?)\.\d{8}$
This assumes your date is always 8 numbers. You can replace the last \d{8} with your appropriate date regex.
This can be tested in RegExr here.
Regex Break-up:
^\/ matches the beginning of the line followed by forward slash
(escaped)
[^/]+ matches all text until it finds the next forward slash (to
mark the end of /external)
(.*?) matches AND captures non-greedily all text you need until it finds the last of the pattern
\.\d{8}$ matches the last period followed by 8 digit date followed by end of line

Remove characters and numbers from a string in perl

I'm trying to rename a bunch of files in my directory and I'm stuck at the regex part of it.
I want to remove certain characters from a filename which appear at the beginning.
Example1: _00-author--book_revision_
Expected: Author - Book (Revision)
So far, I am able to use regex to remove underscores & captialize the first letter
$newfile =~ s/_/ /g;
$newfile =~ s/^[0-9]//g;
$newfile =~ s/^[0-9]//g;
$newfile =~ s/^-//g;
$newfile = ucfirst($newfile);
This is not a good method. I need help in removing all characters until you hit the first letter, and when you hit the first '-' I want to add a space before and after '-'.
Also when I hit the second '-' I want to replace it with '('.
Any guidance, tips or even suggestions on taking the right approach is much appreciated.
So do you want to capitalize all the components of the new filename, or just the first one? Your question is inconsistent on that point.
Note that if you are on Linux, you probably have the rename command, which will take a perl expression and use it to rename files for you, something like this:
rename 'my ($a,$b,$r);$_ = "$a - $b ($r)"
if ($a, $b, $r) = map { ucfirst $_ } /^_\d+-(.*?)--(.*?)_(.*?)_$/' _*
Your instructions and your example don't match.
According to your instructions,
s/^[^\pL]+//; # Remove everything until first letter.
s/-/ - /; # Replace first "-" with " - "
s/-[^-]*\K-/(/; # Replace second "-" with "("
According to your example,
s/^[^\pL]+//;
s/--/ - /;
s/_/ (/;
s/_/)/;
s/(?<!\pL)(\pL)/\U$1/g;
$filename =~ s,^_\d+-(.*?)--(.*?)_(.*?)_$,\u\1 - \u\2 (\u\3),;
My Perl interpreter (using strict and warnings) says that this is better written as:
$filename =~ s,^_\d+-(.*?)--(.*?)_(.*?)_$,\u$1 - \u$2 (\u$3),;
The first one probably is more sedish for its taste! (Of course both version works just the same.)
Explanation (as requested by stema):
$filename =~ s/
^ # matches the start of the line
_\d+- # matches an underscore, one or more digits and a hypen minus
(.*?)-- # matches (non-greedyly) anything before two consecutive hypen-minus
# and captures the entire match (as the first capture group)
(.*?)_ # matches (non-greedyly) anything before a single underscore and
# captures the entire match (as the second capture group)
(.*?)_ # does the same as the one before (but captures the match as the
# third capture group obviously)
$ # matches the end of the line
/\u$1 - \u$2 (\u$3)/x;
The \u${1..3} in replacement specification simply tells Perl to insert the capture groups from 1 to 3 with their first character made upper-case. If you'd wanted to make the entire match (in a captured group) upper-case you'd had to use \U instead.
The x flags turns on verbose mode, which tells the Perl interpreter that we want to use # comments, so it will ignore these (and any white space in the regular expression - so if you want to match a space you have to use either \s or \). Unfortunately I couldn't figure out how to tell Perl to ignore white space in the * replacement* specification - this is why I've written that on a single line.
(Also note that I've changed my s terminator from , to / - Perl barked at me if I used the , with verbose mode turned on ... not exactly sure why.)
If they all follow that format then try:
my ($author, $book, $revision) = $newfiles =~ /-(.*?)--(.*?)_(.*?)_/;
print ucfirst($author ) . " - $book ($revision)\n";

Regular expression to get executed script names from command history

I am trying to write a regex that will parse the syntax for calling a script and capture the script name.
All of these are valid syntax for the call
# normal way
run cred=username/password script.bi
# single quoted username password, also separated in a different way
run cred='username password' script.bi
# username/password is optional
run script.bi
# script extension is optional
run script
# the call might be broken into multiple lines using \
# THIS ONE SHOULD NOT MATCH
run cred=username/password \
script.bi
This is what I have so far
my $r = q{run +(?:cred=(?:[^\s\']*|\'.*\') +)?([^\s\\]+)};
to capture the values in $1.
But I get a
Unmatched [ before HERE mark in regex m/run +(?:cred=(?:[^\s\']*|\'.*\') +)?([ << HERE ^\s\]+)/
The \\ is being treated as a \ and hence in the regex it becomes \] so escaping ] and hence the unmatched [
Replace with run +(?:cred=(?:[^\s\']*|\'.*\') +)?([^\s\\\\]+) ( note the \\\\ ) and try.
Also, from the comments you must be using qr for regex than just q.
( I had just looked at the error, not the validity / efficiency of the regex for your problem)
The essence of your problem with specifying the regex is a difference of one byte: q versus qr. You're writing a regex, so call it what it is. Treating the pattern as a string means you have to deal with the rules for string quoting on top of the rules for regex escaping.
As for the language that your regex matches, add anchors to force the pattern to match the entire line. The regex engine is fiercely determined and will keep working until it finds a match. Without anchors, it's happy to find a substring.
Sometimes this gives you surprising results. Have you ever dealt with a petulant child (or a childish adult) who takes a narrow, exceedingly literal interpretation of what you say? The regex engine is that way, but it's trying to help.
With the last example it matches because
You said with the ? quantifier that the cred=... subpattern can match zero times, so the regex engine skipped it.
You said the script name is the following substring that's a run of one or more non-whitespace, non-backslash characters, so the regex engine saw cred=username/password, none of which are whitespace or backslash characters, and matched. Regexes are greedy: they consider what's right in front of them without regard to whether a given substring “should have” been matched by another subpattern.
The last example fits the bill—although not in the way that you intended. An important lesson with regexes is any quantifier such as ? or * that can match zero times always succeeds!
Without the $ anchor, the pattern from your question leaves the trailing backslash unmatched, which you can see with a slight modification to $runpat.
qr{run +(?:cred=(?:[^\s']*|\'.*\') +)?([^\s\\]+)(.*)}; # ' SO hiliter hack
Notice the (.*) at the end to grab any non-newline characters that may be left. Changing the loop to
while (<DATA>) {
next unless /$runpat/;
print "line $.: \$1=[$1]; \$2=[$2]\n";
}
gives the following output for line 15.
line 15: $1=[cred=username/password]; $2=[ \]
As a complete program, that becomes
#! /usr/bin/env perl
use strict;
use warnings;
# The goofy comment on the next line is a hack to
# help Stack Overflow's syntax highlighter recover
# from its confusion after seeing the quotes. It's
# for presentation only: you won't need it in your
# real code.
my $runpat = qr{^\s*run +(?:cred=(?:[^\s']*|\'.*\') +)?([^\s\\]+)$}; # '
while (<DATA>) {
next unless /$runpat/;
print "line $.: \$1=[$1]\n";
}
__DATA__
# normal way
run cred=username/password script.bi
# single quoted username password, also separated in a different way
run cred='username password' script.bi
# username/password is optional
run script.bi
# script extension is optional
run script
# the call might be broken into multiple lines using \
# THIS ONE SHOULD NOT MATCH
run cred=username/password \
script.bi
Output:
line 2: $1=[script.bi]
line 5: $1=[script.bi]
line 8: $1=[script.bi]
line 11: $1=[script]
Conciseness isn't always helpful with regexes. Consider the following alternative but equivalent specification:
my $runpat = qr{
^ \s*
(?:
run \s+ cred=(?:[^\s']*|'.*?') \s+ (?<script> [^\s\\]+) # ' hiliter
| run \s+ (?!cred=) (?<script> [^\s\\]+)
)
\s* $
}x;
Yes, it takes more room to write, but it's clearer about acceptable alternatives. Your loop is nearly the same
while (<DATA>) {
next unless /$runpat/;
print "line $.: script=[$+{script}]\n";
}
and even spares the poor reader from having to count parentheses.
To use named capture buffers, e.g., (?<script>...), be sure to add
use 5.10.0;
to the top of your program to provide executable documentation of the minimum required version of perl.
Are there sometimes arguments to the script? If not, why not:
/^run(?:\s.*\s|\s)(\S+)\s*$/
I guess that doesn't work on the line continuation bit.
/^run(?:\s+cred=(?:[^'\s]*|'[^']*')\s+|\s+)([^\\\s]+)\s*$/
Test program:
#!/usr/bin/perl
$foo="# normal way
run cred=username/password script.bi
# single quoted username password, also separated in a different way
run cred='username password' script.bi
# username/password is optional
run script.bi
# script extension is optional
run script
# the call might be broken into multiple lines using \
# THIS ONE SHOULD NOT MATCH
run cred=username/password \\
script.bi
";
foreach my $line (split(/\n/,$foo))
{
print "Looking >$line<\n";
print "Match >$1<\n"
if ($line =~ /^run(?:\s+cred=(?:[^'\s]*|'[^']*')\s+|\s+)([^\\\s]+)\s*$/);
}
Example output:
Looking ># normal way<
Looking >run cred=username/password script.bi<
Match >script.bi<
Looking ><
Looking ># single quoted username password, also separated in a different way<
Looking >run cred='username password' script.bi<
Match >script.bi<
Looking ><
Looking ># username/password is optional<
Looking >run script.bi<
Match >script.bi<
Looking ><
Looking ># script extension is optional<
Looking >run script<
Match >script<
Looking ><
Looking ># the call might be broken into multiple lines using <
Looking ># THIS ONE SHOULD NOT MATCH<
Looking >run cred=username/password \<
Looking >script.bi<