Remove characters and numbers from a string in perl - regex

I'm trying to rename a bunch of files in my directory and I'm stuck at the regex part of it.
I want to remove certain characters from a filename which appear at the beginning.
Example1: _00-author--book_revision_
Expected: Author - Book (Revision)
So far, I am able to use regex to remove underscores & captialize the first letter
$newfile =~ s/_/ /g;
$newfile =~ s/^[0-9]//g;
$newfile =~ s/^[0-9]//g;
$newfile =~ s/^-//g;
$newfile = ucfirst($newfile);
This is not a good method. I need help in removing all characters until you hit the first letter, and when you hit the first '-' I want to add a space before and after '-'.
Also when I hit the second '-' I want to replace it with '('.
Any guidance, tips or even suggestions on taking the right approach is much appreciated.

So do you want to capitalize all the components of the new filename, or just the first one? Your question is inconsistent on that point.
Note that if you are on Linux, you probably have the rename command, which will take a perl expression and use it to rename files for you, something like this:
rename 'my ($a,$b,$r);$_ = "$a - $b ($r)"
if ($a, $b, $r) = map { ucfirst $_ } /^_\d+-(.*?)--(.*?)_(.*?)_$/' _*

Your instructions and your example don't match.
According to your instructions,
s/^[^\pL]+//; # Remove everything until first letter.
s/-/ - /; # Replace first "-" with " - "
s/-[^-]*\K-/(/; # Replace second "-" with "("
According to your example,
s/^[^\pL]+//;
s/--/ - /;
s/_/ (/;
s/_/)/;
s/(?<!\pL)(\pL)/\U$1/g;

$filename =~ s,^_\d+-(.*?)--(.*?)_(.*?)_$,\u\1 - \u\2 (\u\3),;
My Perl interpreter (using strict and warnings) says that this is better written as:
$filename =~ s,^_\d+-(.*?)--(.*?)_(.*?)_$,\u$1 - \u$2 (\u$3),;
The first one probably is more sedish for its taste! (Of course both version works just the same.)
Explanation (as requested by stema):
$filename =~ s/
^ # matches the start of the line
_\d+- # matches an underscore, one or more digits and a hypen minus
(.*?)-- # matches (non-greedyly) anything before two consecutive hypen-minus
# and captures the entire match (as the first capture group)
(.*?)_ # matches (non-greedyly) anything before a single underscore and
# captures the entire match (as the second capture group)
(.*?)_ # does the same as the one before (but captures the match as the
# third capture group obviously)
$ # matches the end of the line
/\u$1 - \u$2 (\u$3)/x;
The \u${1..3} in replacement specification simply tells Perl to insert the capture groups from 1 to 3 with their first character made upper-case. If you'd wanted to make the entire match (in a captured group) upper-case you'd had to use \U instead.
The x flags turns on verbose mode, which tells the Perl interpreter that we want to use # comments, so it will ignore these (and any white space in the regular expression - so if you want to match a space you have to use either \s or \). Unfortunately I couldn't figure out how to tell Perl to ignore white space in the * replacement* specification - this is why I've written that on a single line.
(Also note that I've changed my s terminator from , to / - Perl barked at me if I used the , with verbose mode turned on ... not exactly sure why.)

If they all follow that format then try:
my ($author, $book, $revision) = $newfiles =~ /-(.*?)--(.*?)_(.*?)_/;
print ucfirst($author ) . " - $book ($revision)\n";

Related

Perl regex matching alternative file names

Example:
I used to use regex to get extension from the file name:
my $name = "file.zip";
my ($fname, $fext) = $name =~ /(.*)\.(.*)/;
# file
# zip
Now, I need to make sure that it also properly catches .tar.gz files, in case name includes it, otherwise falls back to example above. I did the following:
my $name = "file.tar.gz";
my ($fname, $fext) = $name =~ /(.*)\.(tar\.gz$)|(.*)\.(.*)/;
# file
# tar.gz
Problem:
The problem is that now it only works for file.tar.gz and doesn't fall back to catching regular files, like file.zip, and returns empty, in the second case.
How do I do this in one regex, so it successfully works for file.tar.gz and file.zip. What do I miss?
You may use
/^(.*?)\.(tar\.gz|[^.]*)$/
Details
^ - start of a line
(.*?) - Group 1: any 0+ chars other than line break chars, as few as possible
\. - a dot
(tar\.gz|[^.]*) - Group 2: tar.gz or any 0+ chars other than a dot
$ - end of line.
See the regex demo.
Alternatively, you might also use your original pattern but wrap it with a branch reset group:
/(?|(.*)\.(tar\.gz)|(.*)\.(.*))$/
See this regex demo. It will assign the same IDs to the corresponding capturing groups inside the branch reset group. Since (.*)\.(tar\.gz) will be tried first, if there is a string ending with .tar.gz, the first alternation part ((.*)\.(tar\.gz)) will match, else, the second one ((.*)\.(.*)) will consume the string.
perl -e '$name= "file.zip";($fname,$fext)=$name =~ /(.*)\.(tar\.gz|zip)$/ ;print "$fname.$fext"'
file.zip
the number of captured group, ie. 4, is greater than that of assigned variable ($fname,$fext)= , ie. 2
only the first 2 group is assigned

perl regex remove newlines in string

I have a Perl script which runs over a database dump in a plain text file, trying to remove all instances of newlines and possibly other odd characters when I see strings between quotes:
INSERT INTO ... VALUES ( "... these are the lines I'm interested in." )
I slurp in the file:
#file = <FILE>;
and:
foreach my $line (#file) {
$line =~ s/"[^"]*(\R)+[^"]*"//g;
# I want to get rid of newlines in strings
# And other odd characters I might come across
}
One character class I used instead of (\R) was:
([\r\n\t\v\f]+)
and I would try to:
$line =~ s/"[^"]+?([\r\n\t\v\f]+)[^"]*"//g;
I'm sure I'm missing something. I try to start matching with a literal double quote, scan past anything not a double quote (non-greedy, at least one match), reach the characters I want to get rid of, and keep scanning not double quote (any number of other characters not a double quote) until I reach the ending double quote.
So I wanted to replace $1 capture above with nothing.
I've tried on-line regex builders, and
/"[^"]*?([\r\n\t\f\v]+)[^"]*"/
worked with an on-line test, using a short paragraph with newlines and tabs in it, although it was in PHP pcre mode. I thought it would have worked with Perl.
Perhaps I'm not escaping some characters properly in the regex for Perl? Or the pattern is just not going to work the way I want it to, because it's wrong.
Thank you, any help appreciated.
The regex at regex101.com:
"[^"]*?([\r\n\f\t\v]+)[^"]*?"
matches for strings like this:
"This is
my\t test
string.
So there!"
I'm thoroughly puzzled now. :)
The real problem is that you will only find one group of \R's when there could be many groups between quotes. The best thing to do is make a callback (eval) with a general match between quotes, then substitute the \R's in
the replacement.
something like:
sub repl {
my ($content) = _#;
$content =~ s/\R+//g;
return $content;
}
$input =~ s/"([^"]*)"/ repl($1) /ge;
edit: If you're looking for only 1 linebreak cluster, you have to
exclude linebreaks leading up to it. For example: [^"\r\n]+
edit2: To slurp the file into $input, do a
$/ = undef;
my $input = <$fh>;

Perl - regex - I want to read and search each line for a string followed by a ";"

I'm playing and learning Perl so that I can read log files. I want to search every line and look for a string of alphanumeric followed by this ; at the beginning of each line.
This is part of what I have:
if ($line =~ /\S([a-zA-Z][a-zA-Z0-9]*)/)
but I think this is wrong.
Please advise.
"Alphanumeric" is a bit ambiguous now, since many people still infected with ASCII think it means A-Z with 0-9, but Perl thinks about it differently depending on the version (Know your character classes under different semantics). As with any regular expression, your job is to design a pattern the includes only what you want and doesn't exclude anything that you do want.
Also, many people still use the ^ to mean the beginning of the string, which is does if there's no /m flag. However, the re module can now set default flags, so your regex might not be what you think it is when another programmer tries to be helpful.
I tend to write things like:
my $alphanum = qr/[a-z0-9]/i;
my $regex = qr/
\A # absolute start of string
(?:$alphanum)+ # I can change this elsewhere
;
/x;
if( $line =~ $regex ) { ... }
Try:
if ($line =~ /^[a-z0-9]+;/i) { ... }
^ matches the start of a line. The + matches once or more. /i makes the search case-insensitive.

Regular expression to get executed script names from command history

I am trying to write a regex that will parse the syntax for calling a script and capture the script name.
All of these are valid syntax for the call
# normal way
run cred=username/password script.bi
# single quoted username password, also separated in a different way
run cred='username password' script.bi
# username/password is optional
run script.bi
# script extension is optional
run script
# the call might be broken into multiple lines using \
# THIS ONE SHOULD NOT MATCH
run cred=username/password \
script.bi
This is what I have so far
my $r = q{run +(?:cred=(?:[^\s\']*|\'.*\') +)?([^\s\\]+)};
to capture the values in $1.
But I get a
Unmatched [ before HERE mark in regex m/run +(?:cred=(?:[^\s\']*|\'.*\') +)?([ << HERE ^\s\]+)/
The \\ is being treated as a \ and hence in the regex it becomes \] so escaping ] and hence the unmatched [
Replace with run +(?:cred=(?:[^\s\']*|\'.*\') +)?([^\s\\\\]+) ( note the \\\\ ) and try.
Also, from the comments you must be using qr for regex than just q.
( I had just looked at the error, not the validity / efficiency of the regex for your problem)
The essence of your problem with specifying the regex is a difference of one byte: q versus qr. You're writing a regex, so call it what it is. Treating the pattern as a string means you have to deal with the rules for string quoting on top of the rules for regex escaping.
As for the language that your regex matches, add anchors to force the pattern to match the entire line. The regex engine is fiercely determined and will keep working until it finds a match. Without anchors, it's happy to find a substring.
Sometimes this gives you surprising results. Have you ever dealt with a petulant child (or a childish adult) who takes a narrow, exceedingly literal interpretation of what you say? The regex engine is that way, but it's trying to help.
With the last example it matches because
You said with the ? quantifier that the cred=... subpattern can match zero times, so the regex engine skipped it.
You said the script name is the following substring that's a run of one or more non-whitespace, non-backslash characters, so the regex engine saw cred=username/password, none of which are whitespace or backslash characters, and matched. Regexes are greedy: they consider what's right in front of them without regard to whether a given substring “should have” been matched by another subpattern.
The last example fits the bill—although not in the way that you intended. An important lesson with regexes is any quantifier such as ? or * that can match zero times always succeeds!
Without the $ anchor, the pattern from your question leaves the trailing backslash unmatched, which you can see with a slight modification to $runpat.
qr{run +(?:cred=(?:[^\s']*|\'.*\') +)?([^\s\\]+)(.*)}; # ' SO hiliter hack
Notice the (.*) at the end to grab any non-newline characters that may be left. Changing the loop to
while (<DATA>) {
next unless /$runpat/;
print "line $.: \$1=[$1]; \$2=[$2]\n";
}
gives the following output for line 15.
line 15: $1=[cred=username/password]; $2=[ \]
As a complete program, that becomes
#! /usr/bin/env perl
use strict;
use warnings;
# The goofy comment on the next line is a hack to
# help Stack Overflow's syntax highlighter recover
# from its confusion after seeing the quotes. It's
# for presentation only: you won't need it in your
# real code.
my $runpat = qr{^\s*run +(?:cred=(?:[^\s']*|\'.*\') +)?([^\s\\]+)$}; # '
while (<DATA>) {
next unless /$runpat/;
print "line $.: \$1=[$1]\n";
}
__DATA__
# normal way
run cred=username/password script.bi
# single quoted username password, also separated in a different way
run cred='username password' script.bi
# username/password is optional
run script.bi
# script extension is optional
run script
# the call might be broken into multiple lines using \
# THIS ONE SHOULD NOT MATCH
run cred=username/password \
script.bi
Output:
line 2: $1=[script.bi]
line 5: $1=[script.bi]
line 8: $1=[script.bi]
line 11: $1=[script]
Conciseness isn't always helpful with regexes. Consider the following alternative but equivalent specification:
my $runpat = qr{
^ \s*
(?:
run \s+ cred=(?:[^\s']*|'.*?') \s+ (?<script> [^\s\\]+) # ' hiliter
| run \s+ (?!cred=) (?<script> [^\s\\]+)
)
\s* $
}x;
Yes, it takes more room to write, but it's clearer about acceptable alternatives. Your loop is nearly the same
while (<DATA>) {
next unless /$runpat/;
print "line $.: script=[$+{script}]\n";
}
and even spares the poor reader from having to count parentheses.
To use named capture buffers, e.g., (?<script>...), be sure to add
use 5.10.0;
to the top of your program to provide executable documentation of the minimum required version of perl.
Are there sometimes arguments to the script? If not, why not:
/^run(?:\s.*\s|\s)(\S+)\s*$/
I guess that doesn't work on the line continuation bit.
/^run(?:\s+cred=(?:[^'\s]*|'[^']*')\s+|\s+)([^\\\s]+)\s*$/
Test program:
#!/usr/bin/perl
$foo="# normal way
run cred=username/password script.bi
# single quoted username password, also separated in a different way
run cred='username password' script.bi
# username/password is optional
run script.bi
# script extension is optional
run script
# the call might be broken into multiple lines using \
# THIS ONE SHOULD NOT MATCH
run cred=username/password \\
script.bi
";
foreach my $line (split(/\n/,$foo))
{
print "Looking >$line<\n";
print "Match >$1<\n"
if ($line =~ /^run(?:\s+cred=(?:[^'\s]*|'[^']*')\s+|\s+)([^\\\s]+)\s*$/);
}
Example output:
Looking ># normal way<
Looking >run cred=username/password script.bi<
Match >script.bi<
Looking ><
Looking ># single quoted username password, also separated in a different way<
Looking >run cred='username password' script.bi<
Match >script.bi<
Looking ><
Looking ># username/password is optional<
Looking >run script.bi<
Match >script.bi<
Looking ><
Looking ># script extension is optional<
Looking >run script<
Match >script<
Looking ><
Looking ># the call might be broken into multiple lines using <
Looking ># THIS ONE SHOULD NOT MATCH<
Looking >run cred=username/password \<
Looking >script.bi<

What is the regular Expression to uncomment a block of Perl code in Eclipse?

I need a regular expression to uncomment a block of Perl code, commented with # in each line.
As of now, my find expression in the Eclipse IDE is (^#(.*$\R)+) which matches the commented block, but if I give $2 as the replace expression, it only prints the last matched line. How do I remove the # while replacing?
For example, I need to convert:
# print "yes";
# print "no";
# print "blah";
to
print "yes";
print "no";
print "blah";
In most flavors, when a capturing group is repeated, only the last capture is kept. Your original pattern uses + repetition to match multiple lines of comments, but group 2 can only keep what was captured in the last match from the last line. This is the source of your problem.
To fix this, you can remove the outer repetition, so you match and replace one line at a time. Perhaps the simplest pattern to do this is to match:
^#\s*
And replace with the empty string.
Since this performs match and replacement one line at a time, you must repeat it as many times as necessary (in some flavors, you can use the g global flag, in e.g. Java there are replaceFirst/All pair of methods instead).
References
regular-expressions.info/Repeating a Captured Group vs Capturing a Repeated Group
Related questions
Is there a regex flavor that allows me to count the number of repetitions matched by * and +?
.NET regex keeps all repeated matches
Special note on Eclipse keyboard shortcuts
It Java mode, Eclipse already has keyboard shortcuts to add/remove/toggle block comments. By default, Ctrl+/ binds to the "Toggle comment" action. You can highlight multiple lines, and hit Ctrl+/ to toggle block comments (i.e. //) on and off.
You can hit Ctrl+Shift+L to see a list of all keyboard shortcuts. There may be one in Perl mode to toggle Perl block comments #.
Related questions
What is your favorite hot-key in Eclipse?
Hidden features of Eclipse
Search with ^#(.*$) and replace with $1
You can try this one: -
use strict;
use warning;
my $data = "#Hello#stack\n#overflow\n";
$data =~ s/^?#//g ;
OUTPUT:-
Hello
stack
overflow
Or
open(IN, '<', "test.pl") or die $!;
read(IN, my $data, -s "test.pl"); #reading a file
$data =~ s/^?#//g ;
open(OUT, '>', "test1.pl") or die $!;
print OUT $data; #Writing a file
close OUT;
close IN;
Note: Take care of #!/usr/bin/perl in the Perl script, it will uncomment it also.
You need the GLOBAL g switch.
s/^#(.+)/$1/g
In order to determine whether a perl '#' is a comment or something else, you have to compile the perl and build a parse tree, because of Schwartz's Snippet
whatever / 25 ; # / ; die "this dies!";
Whether that '#' is a comment or part of a regex depends on whether whatever() is nullary, which depends on the parse tree.
For the simple cases, however, yours is failing because (^#(.*$\R)+) repeats a capturing group, which is not what you wanted.
But anyway, if you want to handle simple cases, I don't even like the regex that everyone else is using, because it fails if there is whitespace before the # on the line. What about
^\s*#(.*)$
? This will match any line that begins with a comment (optionally with whitespace, e.g., for indented blocks).
Try this regex:
(^[\t ]+)(\#)(.*)
With this replacement:
$1$3
Group 1 is (^[\t ]+) and matches all leading whitespace (spaces and tabs).
Group 2 is (#) and matches one # character.
Group 3 is (.*) and matches the rest of the line.