Perl Regex to match everything after # character - regex

I have a bunch of text files that contain tags referenced by the # symbol. For e.g. a note is tagged 'home' if the note contains #home.
I am trying to find a Perl Regex that will match everything after the # character but not including the #character.
I have this so far (#\w+) which successfully matches the whole tag (for .e.g it matches #home, #work etc) but I cant find a way to modify it so only the characters after the # character get picked up.
I had a look at this perl regex to match all words following a character but I couldnt seem to work it out from this.
Any help would be great.

As #Quentin said, #(\w+) is the best solution.
#!/usr/bin/perl
while (<>) {
while (/#(\w+)/g) {
print $1, "\n";
}
}
If you DO want to match the tag exactly, you can try (?<=#)\w+ instead. It matches every characters after the #, but # excluded.
#!/usr/bin/perl
while (<>) {
while (/(?<=#)\w+/g) {
print $&, "\n";
}
}
Reference: Using Look-ahead and Look-behind

Just move the # so it is outside the capturing group:
#(\w+)

Related

Capturing text before and after a C-style code block with a Perl regular expression

I am trying to capture some text before and after a C-style code block using a Perl regular expression. So far this is what I have:
use strict;
use warnings;
my $text = << "END";
int max(int x, int y)
{
if (x > y)
{
return x;
}
else
{
return y;
}
}
// more stuff to capture
END
# Regex to match a code block
my $code_block = qr/(?&block)
(?(DEFINE)
(?<block>
\{ # Match opening brace
(?: # Start non-capturing group
[^{}]++ # Match non-brace characters without backtracking
| # or
(?&block) # Recursively match the last captured group
)* # Match 0 or more times
\} # Match closing brace
)
)/x;
# $2 ends up undefined after the match
if ($text =~ m/(.+?)$code_block(.+)/s){
print $1;
print $2;
}
I am having an issue with the 2nd capture group not being initialized after the match. Is there no way to continue a regular expression after a DEFINE block? I would think that this should work fine.
$2 should contain the comment below the block of code but it doesn't and I can't find a good reason why this isn't working.
Capture groups are numbered left-to-right in the order they occur in the regex, not in the order they are matched. Here is a simplified view of your regex:
m/
(.+?) # group 1
(?: # the $code_block regex
(?&block)
(?(DEFINE)
(?<block> ... ) # group 2
)
)
(.+) # group 3
/xs
Named groups can also be accessed as numbered groups.
The 2nd group is the block group. However, this group is only used as a named subpattern, not as a capture. As such, the $2 capture value is undef.
As a consequence, the text after the code-block will be stored in capture $3.
There are two ways to deal with this problem:
For complex regexes, only use named capture. Consider a regex to be complex as soon as you assemble it from regex objects, or if captures are conditional. Here:
if ($text =~ m/(?<before>.+?)$code_block(?<afterwards>.+)/s){
print $+{before};
print $+{afterwards};
}
Put all your defines at the end, where they can't mess up your capture numbering. For example, your $code_block regex would only define a named pattern which you then invoke explicitly.
There are also ready tools that can be leveraged for this, in a few lines of code.
Perhaps the first module to look at is the core Text::Balanced.
The extract_bracketed in list context returns: matched substring, remainder of the string after the match, and the substring before the match. Then we can keep matching in the remainder
use warnings;
use strict;
use feature 'say';
use Text::Balanced qw/extract_bracketed/;
my $text = 'start {some {stuff} one} and {more {of it} two}, and done';
my ($match, $lead);
while (1) {
($match, $text, $lead) = extract_bracketed($text, '{', '[^{]*');
say $lead // $text;
last if not defined $match;
}
what prints
start
and
, and done
Once there is no match we need to print the remainder, thus $lead // $text (as there can be no $lead either). The code uses $text directly and modifies it, down to the last remainder; if you'd like to keep the original text save it away first.
I've used a made-up string above, but I tested it on your code sample as well.
This can also be done using Regexp::Common.
Break the string using its $RE{balanced} regex, then take odd elements
use Regexp::Common qw(balanced);
my #parts = split /$RE{balanced}{-parens=>'{}'}/, $text;
my #out_of_blocks = #parts[ grep { $_ & 1 } 1..$#parts ];
say for #out_of_blocks;
If the string starts with the delimiter the first element is an empty string, as usual with split.
To clean out leading and trailing spaces pass it through map { s/(^\s*|\s*$//gr }.
You're very close.
(?(DEFINE)) will define the expression & parts you want to use but it doesn't actually do anything other than define them. Think of this tag (and everything it envelops) as you defining variables. That's nice and clean, but defining the variables doesn't mean the variables get used!
You want to use the code block after defining it so you need to add the expression after you've declared your variables (like in any programming language)
(?(DEFINE)
(?<block>\{(?:[^{}]++|(?&block))*\})
)
(?&block)
This part defines your variables
(?(DEFINE)
(?<block>\{(?:[^{}]++|(?&block))*\})
)
This part calls your variables into use.
(?&block)
Edits
Edit 1
(?(DEFINE)
(?<block>\{(?:[^{}]++|(?&block))*\})
)
(?&block)\s*(?:\/\/|\/\*)([\s\S]*?)(?:\r\n|\r|\n|$)
The regex above will get the comment after a block (as you've already defined).
You had a . which will match any character (except newline - unless you use the s modifier which specifies that . should also match newline characters)
Edit 2
(?(DEFINE)
(?<block>\{(?:[^{}]++|(?&block))*\})
)
(?&block)\s*(?:(?:\/\/([\s\S]*?)(?:\r\n|\r|\n|$))|\/\*([\s\S]*?)\*\/)
This regex is more syntactically correct for capturing comments. The previous edit will work with /* up until a new line or end of file. This one will work until the closing tag or end of file.
Edit 3
As for your code not working, I'm not exactly sure. You can see your code running here and it seems to be working just fine. I would use one of the regular expressions I've written above instead.
Edit 4
I think I finally understand what you're saying. What you're trying to do is impossible with regex. You cannot reference a group without capturing it, therefore, the only true solution is to capture it. There is, however, a hack-around alternative that works for your situation. If you want to grab the first and last sections without the second section you can use the following regex, which, will not check the second section of your regex for proper syntax (downside). If you do need to check the syntax you're going to have to deal with there being an additional capture group.
(.+?)\{.*\}\s*(?:(?:\/\/([\s\S]*?)(?:\r\n|\r|\n|$))|\/\*([\s\S]*?)\*\/)
This regex captures everything before the { character, then matches everything after it until it meets } followed by any whitespace, and finally by //. This, however, will break if you have a comment within a block of code (after a })

What regex should be used to avoid symbols before text in perl?

i have following variable. i only want to print yes if the variable has "imoport/canada/campingplaces/tobermory" not # or anything. What should insert in a regex for this kind of things.
my $textfile = "# imoport/canada/campingplaces/tobermory
imoport/canada/campingplaces/tobermory
#imoport/canada/campingplaces/tobermory";
my $textNeeded= "imoport/canada/campingplaces/tobermory"
THIS IS WHAT i am using
if ($textfile =~ m/$textNeeded/i) {
print "yes working"
}
note:- i am getting data from differnt text files so some text files might just have "#imoport/canada/campingplaces/tobermory". I want to avoid those
Despite the quite vague problem description, I think I have puzzled out what you mean. You mean you may have lines where the text is commented out with #, and you want to avoid matching those.
print "yes" if $textfile =~ /^\s*$textNeeded/im;
This will match any string inside $textfile which has a newline followed by optional whitespace followed by your string. The /m option makes the regex multiline, meaning that ^ and $ match line endings represented by newlines inside a larger string.
You may wish to be wary of regex meta characters in your search string. If for example your search string is foo[bar].txt, those brackets will be interpreted as a character class instead. In which case you would use
/^\s*\Q$textNeeded\E/im
instead. The \Q ... \E will make the text inside match only literal characters.
I think you need to create an Anchor to say you want a match if your target string appears at the BEGINNING of the line. This uses the up-carat symbol:
if ($textfile =~ m/^$textNeeded/i) {
print "yes working"
}
This wont report a match if you have spaces or tabs before your textNeeded string.
To simply return the rows having no leading hash, something like this:
my $textfile = "# imoport/canada/campingplaces/tobermory
imoport/canada/campingplaces/tobermory
#imoport/canada/campingplaces/tobermory";
for (split /^/, $textfile) {
print $_ if(m/^\s*[a-zA-Z].*/);
}
Returns:
imoport/canada/campingplaces/tobermory

Perl Regex negation for multiple words

I need to exclude some URLs for a jMeter test:
dont exclude:
http://foo/bar/is/valid/with/this
http://foo/bar/is/also/valid/with/that
exclude:
http://foo/bar/is/not/valid/with/?=action
http://foo/bar/is/not/valid/with/?=action
http://foo/bar/is/not/valid/with/specialword
Please help me?
My following Regex isnt working:
foo/(\?=|\?action|\?form_action|specialword).*
First problem: / is the general delimiter so escape it with \/ or alter the delimiter.
Second Problem: It will match only foo/action and so on, you need to include a wildcard before the brackets: foo\/.*(\?=|\?action|\?form_action|specialword).*
So:
/foo\/.*(\?=|\?action|\?form_action|specialword).*/
Next problem is that this will match the opposite: Your excludes. You can either finetune your regex to do the inverse OR you can handle this in your language (i.e. if there is no match, do this and that).
Always pay attention to special characters in regex. See here also.
There are countless ways to shoot yourself in the foot with regular expressions. You could write some kind of "parser" using /g and /c in a loop, but why bother? It seems like you are already having trouble with the current regular expression.
Break the problem down into smaller parts and everything will be less complicated. You could write yourself some kind of filter for grep like:
sub filter {
my $u = shift;
my $uri = URI->new($u);
return undef if $uri->query;
return undef if grep { $_ eq 'specialword' } $uri->path_segments;
return $u;
}
say for grep {filter $_} #urls;
I wouldn't cling that hard to a regular expression, especially if others have to read the code too...
Change the regex delimiter to something other than '/' so you don't have to escape it in your matches. You might do:
m{//foo/.+(?:\?=action|\?form_action|specialword)$};
The ?: denotes grouping-only.
Using this, you could say:
print unless m{//foo/.+(?:\?=action|\?form_action|specialword)$};
Your alternation is wrong. foo/(\?=|\?action|\?form_action|specialword) matches any of
foo/?=
foo/?action
foo/?form_action
foo/?specialword
so you need instead
m{foo/.*(?:\?=action|\?=form_action|specialword)}
The .* is necessary to account for the possible bar/is/valid/with/this after /foo/.
Note that I have changed your ( .. ) to the non-capturing (?: .. ) and I have used braces for the regex delimiter to avoid having to escape the slashes in the expression.
Finally, you need to write either
unless ($url =~ m{/foo/.*(?:\?=action|\?=form_action|specialword)}) { ... }
or
if ($url !~ m{/foo/.*(?:\?=action|\?=form_action|specialword)}) { ... }
since the regex matches URLs that are to be discarded.

Remove characters and numbers from a string in perl

I'm trying to rename a bunch of files in my directory and I'm stuck at the regex part of it.
I want to remove certain characters from a filename which appear at the beginning.
Example1: _00-author--book_revision_
Expected: Author - Book (Revision)
So far, I am able to use regex to remove underscores & captialize the first letter
$newfile =~ s/_/ /g;
$newfile =~ s/^[0-9]//g;
$newfile =~ s/^[0-9]//g;
$newfile =~ s/^-//g;
$newfile = ucfirst($newfile);
This is not a good method. I need help in removing all characters until you hit the first letter, and when you hit the first '-' I want to add a space before and after '-'.
Also when I hit the second '-' I want to replace it with '('.
Any guidance, tips or even suggestions on taking the right approach is much appreciated.
So do you want to capitalize all the components of the new filename, or just the first one? Your question is inconsistent on that point.
Note that if you are on Linux, you probably have the rename command, which will take a perl expression and use it to rename files for you, something like this:
rename 'my ($a,$b,$r);$_ = "$a - $b ($r)"
if ($a, $b, $r) = map { ucfirst $_ } /^_\d+-(.*?)--(.*?)_(.*?)_$/' _*
Your instructions and your example don't match.
According to your instructions,
s/^[^\pL]+//; # Remove everything until first letter.
s/-/ - /; # Replace first "-" with " - "
s/-[^-]*\K-/(/; # Replace second "-" with "("
According to your example,
s/^[^\pL]+//;
s/--/ - /;
s/_/ (/;
s/_/)/;
s/(?<!\pL)(\pL)/\U$1/g;
$filename =~ s,^_\d+-(.*?)--(.*?)_(.*?)_$,\u\1 - \u\2 (\u\3),;
My Perl interpreter (using strict and warnings) says that this is better written as:
$filename =~ s,^_\d+-(.*?)--(.*?)_(.*?)_$,\u$1 - \u$2 (\u$3),;
The first one probably is more sedish for its taste! (Of course both version works just the same.)
Explanation (as requested by stema):
$filename =~ s/
^ # matches the start of the line
_\d+- # matches an underscore, one or more digits and a hypen minus
(.*?)-- # matches (non-greedyly) anything before two consecutive hypen-minus
# and captures the entire match (as the first capture group)
(.*?)_ # matches (non-greedyly) anything before a single underscore and
# captures the entire match (as the second capture group)
(.*?)_ # does the same as the one before (but captures the match as the
# third capture group obviously)
$ # matches the end of the line
/\u$1 - \u$2 (\u$3)/x;
The \u${1..3} in replacement specification simply tells Perl to insert the capture groups from 1 to 3 with their first character made upper-case. If you'd wanted to make the entire match (in a captured group) upper-case you'd had to use \U instead.
The x flags turns on verbose mode, which tells the Perl interpreter that we want to use # comments, so it will ignore these (and any white space in the regular expression - so if you want to match a space you have to use either \s or \). Unfortunately I couldn't figure out how to tell Perl to ignore white space in the * replacement* specification - this is why I've written that on a single line.
(Also note that I've changed my s terminator from , to / - Perl barked at me if I used the , with verbose mode turned on ... not exactly sure why.)
If they all follow that format then try:
my ($author, $book, $revision) = $newfiles =~ /-(.*?)--(.*?)_(.*?)_/;
print ucfirst($author ) . " - $book ($revision)\n";

Search html file for random string using regex

I am trying to use Perl to search through an html file, looking for a semi-random string and store the match in a variable or print it out.
The string is the name of a jpg image and always follows the pattern of 9 digits followed by 6 lower case letters, i.e.
140005917smpxgj.jpg
But it is random every time. I am sure Perl can do this, but I will admit I am getting a bit confused.
Not too complicated. You may want to watch out for varying caps in the extension, e.g. JPG. If that is a concern, you may add (?i) before the extension.
You may also wish to prevent partial names, e.g. discard a match that has more than 9 digits. That is the (?<!\d) part: Make sure no digit characters precede the match.
ETA: Now extracts multiple matches too, thanks to ikegami.
while (<>) {
for (/(?<!\d)([0-9]{9}[a-z]{6}\.(?i)jpg)/g) {
say;
push #match, $_;
}
}
Try this regex:
/\b\d{9}[a-z]{6}\.jpg/
perldoc perlre
use warnings;
use strict;
while (<DATA>) {
if (/ ( [0-9]{9} [a-z]{6} [.] jpg ) /x) {
print "$1\n";
}
}
__DATA__
foo 140005917smpxgj.jpg bar
sdfads 777666999abcdef.jpg dfgffgh
Prints:
140005917smpxgj.jpg
777666999abcdef.jpg
the solution regex is \d{9}[a-z]{6}\.jpg