Perl regular expression letter.number - regex

Perl
I have strings:
xxxx.log, log.1, log.2, blog, photolog
So I would like to match only the strings named
[log.num] e.g log.1 log.12 log.3
[.log] e.g xxxx.log
not include the string named blog, phtotlog.
any helps would be appreciated.
So far this is my code but i will match blog and some words have "log" that is not i want.
while( <> ) {
printf "%s",$_ if /log/;
}

As explained, \blog\b may work for your data. However, if log can come in the string along yet other non-word characters (like log-a.txt) that gets matched, too, so you need be more specific. One way is to match precisely what is expected. By the shown sample
while (<>) {
print if /(?: \.log | log\.\d+ )$/x;
}
where (?: ) is the non capturing group, used so that either of the alternation patterns is anchored to the end of the string, $ (but not needlessly captured). Otherwise we'd match x.log.OLD or such. With the /x modifier spaces may be used without being matched, good for readability.
The patterns in alternation | can be combined but that gets far more complicated.
The printf %s, $var has no advantage over print $var (unless the format is more involved).
A one-liner test
perl -wE'
#ary = qw(log-out xxxx.log log.1 log.2 blog photolog);
/(?:\.log|log\.\d+)$/ && say for #ary
'
where feature say is used for the newline, needed here. In a one-liner -E (capital) enables it.

This should do the trick:
while( <> ) {
printf "%s",$_ if /\blog\b/;
}
The key being the \b that's a word boundary. You can read about them here: http://www.regular-expressions.info/wordboundaries.html but pretty much it's a space, non letter, the beginning of the line, or end of the line. Another option would be:
/\.log\d/
That would match .log3 The period has to be escaped and the \d means a number. That being said I think \b is what you want.

Related

Capturing text before and after a C-style code block with a Perl regular expression

I am trying to capture some text before and after a C-style code block using a Perl regular expression. So far this is what I have:
use strict;
use warnings;
my $text = << "END";
int max(int x, int y)
{
if (x > y)
{
return x;
}
else
{
return y;
}
}
// more stuff to capture
END
# Regex to match a code block
my $code_block = qr/(?&block)
(?(DEFINE)
(?<block>
\{ # Match opening brace
(?: # Start non-capturing group
[^{}]++ # Match non-brace characters without backtracking
| # or
(?&block) # Recursively match the last captured group
)* # Match 0 or more times
\} # Match closing brace
)
)/x;
# $2 ends up undefined after the match
if ($text =~ m/(.+?)$code_block(.+)/s){
print $1;
print $2;
}
I am having an issue with the 2nd capture group not being initialized after the match. Is there no way to continue a regular expression after a DEFINE block? I would think that this should work fine.
$2 should contain the comment below the block of code but it doesn't and I can't find a good reason why this isn't working.
Capture groups are numbered left-to-right in the order they occur in the regex, not in the order they are matched. Here is a simplified view of your regex:
m/
(.+?) # group 1
(?: # the $code_block regex
(?&block)
(?(DEFINE)
(?<block> ... ) # group 2
)
)
(.+) # group 3
/xs
Named groups can also be accessed as numbered groups.
The 2nd group is the block group. However, this group is only used as a named subpattern, not as a capture. As such, the $2 capture value is undef.
As a consequence, the text after the code-block will be stored in capture $3.
There are two ways to deal with this problem:
For complex regexes, only use named capture. Consider a regex to be complex as soon as you assemble it from regex objects, or if captures are conditional. Here:
if ($text =~ m/(?<before>.+?)$code_block(?<afterwards>.+)/s){
print $+{before};
print $+{afterwards};
}
Put all your defines at the end, where they can't mess up your capture numbering. For example, your $code_block regex would only define a named pattern which you then invoke explicitly.
There are also ready tools that can be leveraged for this, in a few lines of code.
Perhaps the first module to look at is the core Text::Balanced.
The extract_bracketed in list context returns: matched substring, remainder of the string after the match, and the substring before the match. Then we can keep matching in the remainder
use warnings;
use strict;
use feature 'say';
use Text::Balanced qw/extract_bracketed/;
my $text = 'start {some {stuff} one} and {more {of it} two}, and done';
my ($match, $lead);
while (1) {
($match, $text, $lead) = extract_bracketed($text, '{', '[^{]*');
say $lead // $text;
last if not defined $match;
}
what prints
start
and
, and done
Once there is no match we need to print the remainder, thus $lead // $text (as there can be no $lead either). The code uses $text directly and modifies it, down to the last remainder; if you'd like to keep the original text save it away first.
I've used a made-up string above, but I tested it on your code sample as well.
This can also be done using Regexp::Common.
Break the string using its $RE{balanced} regex, then take odd elements
use Regexp::Common qw(balanced);
my #parts = split /$RE{balanced}{-parens=>'{}'}/, $text;
my #out_of_blocks = #parts[ grep { $_ & 1 } 1..$#parts ];
say for #out_of_blocks;
If the string starts with the delimiter the first element is an empty string, as usual with split.
To clean out leading and trailing spaces pass it through map { s/(^\s*|\s*$//gr }.
You're very close.
(?(DEFINE)) will define the expression & parts you want to use but it doesn't actually do anything other than define them. Think of this tag (and everything it envelops) as you defining variables. That's nice and clean, but defining the variables doesn't mean the variables get used!
You want to use the code block after defining it so you need to add the expression after you've declared your variables (like in any programming language)
(?(DEFINE)
(?<block>\{(?:[^{}]++|(?&block))*\})
)
(?&block)
This part defines your variables
(?(DEFINE)
(?<block>\{(?:[^{}]++|(?&block))*\})
)
This part calls your variables into use.
(?&block)
Edits
Edit 1
(?(DEFINE)
(?<block>\{(?:[^{}]++|(?&block))*\})
)
(?&block)\s*(?:\/\/|\/\*)([\s\S]*?)(?:\r\n|\r|\n|$)
The regex above will get the comment after a block (as you've already defined).
You had a . which will match any character (except newline - unless you use the s modifier which specifies that . should also match newline characters)
Edit 2
(?(DEFINE)
(?<block>\{(?:[^{}]++|(?&block))*\})
)
(?&block)\s*(?:(?:\/\/([\s\S]*?)(?:\r\n|\r|\n|$))|\/\*([\s\S]*?)\*\/)
This regex is more syntactically correct for capturing comments. The previous edit will work with /* up until a new line or end of file. This one will work until the closing tag or end of file.
Edit 3
As for your code not working, I'm not exactly sure. You can see your code running here and it seems to be working just fine. I would use one of the regular expressions I've written above instead.
Edit 4
I think I finally understand what you're saying. What you're trying to do is impossible with regex. You cannot reference a group without capturing it, therefore, the only true solution is to capture it. There is, however, a hack-around alternative that works for your situation. If you want to grab the first and last sections without the second section you can use the following regex, which, will not check the second section of your regex for proper syntax (downside). If you do need to check the syntax you're going to have to deal with there being an additional capture group.
(.+?)\{.*\}\s*(?:(?:\/\/([\s\S]*?)(?:\r\n|\r|\n|$))|\/\*([\s\S]*?)\*\/)
This regex captures everything before the { character, then matches everything after it until it meets } followed by any whitespace, and finally by //. This, however, will break if you have a comment within a block of code (after a })

Regex to match absence of substring

I am looking for a regex which only matches when certain sub-strings are not present. In particular - if a line of code does not assign or return the return value from a method.
Examples:
this.execute(); // should match
var x = this.execute(); // no match
return this.execute(); // no match
I was trying to use the following regex
^(?!.*=|return).*execute\(\).*
This works with regex testers etc. - but I am getting "invalid perl operator" exception when using in practice.
Thanks..
Since you want to exclude only assignment or return it's easily negated
while (<DATA>) { print if not /(?:=|return)\s+this\.execute/ }
__DATA__
this.execute();
var x = this.execute();
return this.execute();
This prints only the line this.execute();.
With Lookaround Assertions, a negative lookahead that you offer does work
if (/^(?!.*=|return)\s+this\.execute/x) { print "$_\n" }
As for the negative lookbehind, there is one problem. First, here's what works
if ( /(?<! =\s ) this\.execute/x ) { print "$_\n" }
if ( /(?<! return \s ) this\.execute/x ) { print "$_\n" }
This excludes = or return, with one space. The thing is, we can't put \s+ there nor can we do alternation -- Perl can't do it for this particular assertion, see perlretut. We get
Variable length lookbehind not implemented in regex m/(?<!=\s+)this\.execute/ at
We can add varying space \s+ outside of the assertion, with this..., and then combine multiple conditions to provide for a possibility that there is no space between = and this....
However, there's no reason for this if you can use a regular negated match.
The reported error can only be about basic syntax. It is about the exact code you run, not the regex.
Not so sure if I understand the question but you might consider trying this one. ^this.execute\(\);
With situations like these, its best to find the "lowest common denominator" in the matches you want to distinguish from similar looking strings. In this case, the var x can be ignored - your requirements are satisfied by saying "anything before the method call is ok - the method call alone is not." That statement is probably a bit too tight though, so let's change it to "anthing other than whitespace before the method call is ok, otherwise flag the call". Which means;
my $method_call = qr/ ( this \. \w+ ) \( /x;
while (<$fh>) {
if (/ ^ \s* $method_call /x) {
warn "Found method call on line $.: $1\n"
}
}
I'm presumming $fh is a filehandle to the souce code file. I've also made some presumptions which you may need to tweek about how you want to define a method call - ie. opening bracket for parameters is compulsory. Using 'extended mode regexs' allows the use of whitespace in the regex for easier reading. Also, using 'quote rule' allows referring to a regex by name inside another to make things clearer.
If on the other hand, you want to insist on the presence of var x or return before giving the ok, we can reverse the search - ie explicitly look for the "ok" situations and flag any other calls:
my $method_call = qr/ ( this \. \w+ ) \( /x;
while (<$fh>) {
next if / ^ \s* return \s+ $method_call /x; # return OK
next if / ^ \s* var \s+ \w+ = \s+ $method_call /x; # var OK
warn "Found method call on line $.: $1\n" if /$method_call/ ;
}
Both of these are a little verbose but show more clearly what you're trying to do.
I don't think we have enough information here. I say this because the following works for me in the shell
~$ echo "execute()"| perl -ne 'print if /^(?!.*=|return).*execute\(\).*/'
execute()
~$ echo "return execute()"| perl -ne 'print if /^(?!.*=|return).*execute\(\).*/'
~$
In the above code, I am running a one liner in a shell that pipes a string into a perl program. The perl program will print the string if it matches the regex. I get no errors from your regex.
It's possible that the error is due to your version of perl or something else entirely may be happening.
I am using perl v5.22.2
I mean, the simple answer is, just use the ! operator on your test, but here's the conversion in case you were wondering:
/expression/ => /^(?!.*expression)/ (either use DOTALL or [^] in JavaScript)
/^expression/ => /^(?!expression)/

Regex to find text between second and third slashes

I would like to capture the text that occurs after the second slash and before the third slash in a string. Example:
/ipaddress/databasename/
I need to capture only the database name. The database name might have letters, numbers, and underscores. Thanks.
How you access it depends on your language, but you'll basically just want a capture group for whatever falls between your second and third "/". Assuming your string is always in the same form as your example, this will be:
/.*/(.*)/
If multiple slashes can exist, but a slash can never exist in the database name, you'd want:
/.*/(.*?)/
/.*?/(.*?)/
In the event that your lines always have / at the end of the line:
([^/]*)/$
Alternate split method:
split("/")[2]
The regex would be:
/[^/]*/([^/]*)/
so in Perl, the regex capture statement would be something like:
($database) = $text =~ m!/[^/]*/([^/]*)/!;
Normally the / character is used to delimit regexes but since they're used as part of the match, another character can be used. Alternatively, the / character can be escaped:
($database) = $text =~ /\/[^\/]*\/([^\/]*)\//;
You can even more shorten the pattern by going this way:
[^/]+/(\w+)
Here \w includes characters like A-Z, a-z, 0-9 and _
I would suggest you to give SPLIT function a priority, since i have experienced a good performance of them over RegEx functions wherever it is possible to use them.
you can use explode function with PHP or split with other languages to so such operation.
anyways, here is regex pattern:
/[\/]*[^\/]+[\/]([^\/]+)/
I know you specifically asked for regex, but you don't really need regex for this. You simply need to split the string by delimiters (in this case a backslash), then choose the part you need (in this case, the 3rd field - the first field is empty).
cut example:
cut -d '/' -f 3 <<< "$string"
awk example:
awk -F '/' {print $3} <<< "$string"
perl expression, using split function:
(split '/', $string)[2]
etc.

Perl - regex - I want to read and search each line for a string followed by a ";"

I'm playing and learning Perl so that I can read log files. I want to search every line and look for a string of alphanumeric followed by this ; at the beginning of each line.
This is part of what I have:
if ($line =~ /\S([a-zA-Z][a-zA-Z0-9]*)/)
but I think this is wrong.
Please advise.
"Alphanumeric" is a bit ambiguous now, since many people still infected with ASCII think it means A-Z with 0-9, but Perl thinks about it differently depending on the version (Know your character classes under different semantics). As with any regular expression, your job is to design a pattern the includes only what you want and doesn't exclude anything that you do want.
Also, many people still use the ^ to mean the beginning of the string, which is does if there's no /m flag. However, the re module can now set default flags, so your regex might not be what you think it is when another programmer tries to be helpful.
I tend to write things like:
my $alphanum = qr/[a-z0-9]/i;
my $regex = qr/
\A # absolute start of string
(?:$alphanum)+ # I can change this elsewhere
;
/x;
if( $line =~ $regex ) { ... }
Try:
if ($line =~ /^[a-z0-9]+;/i) { ... }
^ matches the start of a line. The + matches once or more. /i makes the search case-insensitive.

Extracting some data items in a string using regular expression

<![Apple]!>some garbage text may be here<![Banana]!>some garbage text may be here<![Orange]!><![Pear]!><![Pineapple]!>
In the above string, I would like to have a regex that matches all <![FruitName]!>, between these <![FruitName]!>, there may be some garbage text, my first attempt is like this:
<!\[[^\]!>]+\]!>
It works, but as you can see I've used this part:
[^\]!>]+
This kills some innocents. If the fruit name contains any one of these characters: ] ! > It'd be discarded and we love eating fruit so much that this should not happen.
How do we construct a regex that disallows exactly this string ]!> in the FruitName while all these can still be obtained?
The above example is just made up by me, I just want to know what the regex would look like if it has to be done in regex.
The simplest way would be <!\[.+?]!> - just don't care about what is matched between the two delimiters at all. Only make sure that it always matches the closing delimiter at the earliest possible opportunity - therefore the ? to make the quantifier lazy.
(Also, no need to escape the ])
About the specification that the sequence ]!> should be "disallowed" within the fruit name - well that's implicit since it is the closing delimiter.
To match a fruit name, you could use:
<!\[(.*?)]!>
After the opening <![, this matches the least amount of text that's followed by ]!>. By using .*? instead of .*, the least possible amount of text is matched.
Here's a full regex to match each fruit with the following text:
<!\[(.*?)]!>(.*?)(?=(<!\[)|$)
This uses positive lookahead (?=xxx) to match the beginning of the next tag or end-of-string. Positive lookahead matches but does not consume, so the next fruit can be matched by another application of the same regex.
depending on what language you are using, you can use the string methods your language provide by doing simple splitting (and simple regex that is more understandable). Split your string using "!>" as separator. Go through each field, check for <!. If found, replace all characters from front till <!. This will give you all the fruits. I use gawk to demonstrate, but the algorithm can be implemented in your language
eg gawk
# set field separator as !>
awk -F'!>' '
{
# for each field
for(i=1;i<=NF;i++){
# check if there is <!
if($i ~ /<!/){
# if <! is found, substitute from front till <!
gsub(/.*<!/,"",$i)
}
# print result
print $i
}
}
' file
output
# ./run.sh
[Apple]
[Banana]
[Orange]
[Pear]
[Pineapple]
No complicated regex needed.