perl regex - anchors and pattern matching - regex

I coded perl regex to extract the words after a certain anchor,
it seems like its not working. What am I doing wrong.
This is my actual output, I need to extract every number after groups keyword
$id cuser301 uid=2301(cuser301) gid=32(rpc) groups=32(rpc),1001(cgrp1),1002(cgrp2),1003(cgrp3),1004(cgrp4),1005(cgrp5),1006(cgrp6),1007(cgrp7),1008(cgrp8),1009(cgrp9),1010(cgrp10),1011(cgrp11),1012(cgrp12),1013(cgrp13),1014(cgrp14),1015(cgrp15),1016(cgrp16),1017(cgrp17),1018(cgrp18),1019(cgrp19),1020(cgrp20),1021(cgrp21),1022(cgrp22),1023(cgrp23),1024(cgrp24),1025(cgrp25),1026(cgrp26),1027(cgrp27),1028(cgrp28),1029(cgrp29),1030(cgrp30),1031(cgrp31),1032(cgrp32)
From the above, I run the id command and then would like to capture the numbers after groups Please help.
I am using the following.
my $check_groups = execute("\id $user"); #---> (execute is to run commands on the linux client, please ignore it)
my $new_groups = ('/^groups/',$check_groups); # ---> Now $new_groups should have all numbers after groups.

my $input = '$id cuser301 uid=2301(cuser301) gid=32(rpc) groups=32(rpc),1001(cgrp1),1002(cgrp2),1003(cgrp3),1004(cgrp4),1005(cgrp5),1006(cgrp6),1007(cgrp7),1008(cgrp8),1009(cgrp9),1010(cgrp10),1011(cgrp11),1012(cgrp12),1013(cgrp13),1014(cgrp14),1015(cgrp15),1016(cgrp16),1017(cgrp17),1018(cgrp18),1019(cgrp19),1020(cgrp20),1021(cgrp21),1022(cgrp22),1023(cgrp23),1024(cgrp24),1025(cgrp25),1026(cgrp26),1027(cgrp27),1028(cgrp28),1029(cgrp29),1030(cgrp30),1031(cgrp31),1032(cgrp32)';
print join ',', $input =~ /(?:.*groups=|\G.*?)\b([0-9]+)/g;
This is a common pattern; in more complicated cases where you want to ensure the \G branch only applies after the first non-zero-length match, you can use \G(?!\A) instead of just \G.

Try doing this :
$ echo <INPUT> | perl -ne 'print "$1," while /,(\d+)\(/g'
Check https://regex101.com/r/uZ9tO6/1

Related

Regex does not match in Perl, while it does in other programs

I have the following string:
load Add 20 percent
to accommodate
I want to get to:
load Add 20 percent to accommodate
With, e.g., regex in sublime, this is easily done by:
Regex:
([a-z])\n\s([a-z])
Replace:
$1 $2
However, in Perl, if I input this command, (adapted to test if I can match the pattern in any case):
perl -pi.orig -e 's/[a-z]\n.+to/TEST/g' file
It doesn't match anything.
Does anyone know why Perl would be different in this case, and what the correct formulation of the Perl command should be?
By default, Perl -p flag read input lines one by one. You can't thus expect your regex to match anything after \n.
Instead, you want to read the whole input at once. You can do this by using the flag -0777 (this is documented in perlrun):
perl -0777 -pi.orig -e 's/([a-z])\n\s(to)/$1 $2/' file
Just trying to help and reminding below your initial proposal for perl regex:
perl -pi.orig -e 's/[a-z]\n.+to/TEST/g' file
Note that in perl regex, [a-z] will match only one character, NOT including any whitespace. Then as a start please include a repetition specifier and include capability to also 'eat' whitespaces. Also to keep the recognized (but 'eaten') 'to' in the replacement, you must put it again in the replacement string, like finally in the below example perl program:
$str = "load Add 20 percent
to accommodate";
print "before:\n$str\n";
$str =~ s/([ a-z]+)\n\s*to/\1 to/;
print "after:\n$str\n";
This program produces the below input:
before:
load Add 20 percent
to accommodate
after:
load Add 20 percent to accommodate
Then it looks like that if I understood well what you want to do, your regexp should better look like:
s/([ a-z]+)\n\s*to/\1 to/ (please note the leading whitespace before 'a-z').

Perl Regex Command Line Issue

I'm trying to use a negative lookahead in perl in command line:
echo 1.41.1 | perl -pe "s/(?![0-9]+\.[0-9]+\.)[0-9]$/2/g"
to get an incremented version that looks like this:
1.41.2
but its just returning me:
![0-9]+\.[0-9]+\.: event not found
i've tried it in regex101 (PCRE) and it works fine, so im not sure why it doesn't work here
In Bash, ! is the "history expansion character", except when escaped with a backslash or single-quotes. (Double-quotes do not disable this; that is, history expansion is supported inside double-quotes. See Difference between single and double quotes in Bash)
So, just change your double-quotes to single-quotes:
echo 1.41.1 | perl -pe 's/(?![0-9]+\.[0-9]+\.)[0-9]$/2/g'
and voilĂ :
1.41.2
I'm guessing that this expression also might work:
([0-9.]+)\.([0-9]+)
Test
perl -e'
my $name = "1.41.1";
$name =~ s/([0-9.]+)\.([0-9]+)/$1\.2/;
print "$name\n";
'
Output
1.41.2
Please see the demo here.
If you want to "increment" a number then you can't hard-code the new value but need to capture what is there and increment that
echo "1.41.1" | perl -pe's/[0-9]+\.[0-9]+\.\K([0-9]+)/$1+1/e'
Here /e modifier makes it so that the replacement side is evaluated as code, and we can +1 the captured number, what is then substituted. The \K drops previous matches so we don't need to put them back; see "Lookaround Assertions" in Extended Patterns in perlre.
The lookarounds are sometimes just the thing you want, but they increase the regex complexity (just by being there), can be tricky to get right, and hurt efficiency. They aren't needed here.
The strange output you get is because the double quotes used around the Perl program "invite" the shell to look at what's inside whereby it interprets the ! as history expansion and runs that, as explained in ruakh's post.
As an alternate to lookahead, we can use capture groups, e.g. the following will capture the version number into 3 capture groups.
(\d+)\.(\d+)\.(\d+)
If you wanted to output the captured version number as is, it would be:
\1.\2.\3
And to just replace the 3rd part with the number "2" would be:
\1.\2.2
To adapt this to the OP's question, it would be:
$ echo 1.14.1 | perl -pe 's/(\d+)\.(\d+)\.(\d+)/\1.\2.2/'
1.14.2
$

Conditional in perl regex replacement

I'm trying to return different replacement results with a perl regex one-liner if it matches a group. So far I've got this:
echo abcd | perl -pe "s/(ab)(cd)?/defined($2)?\1\2:''/e"
But I get
Backslash found where operator expected at -e line 1, near "1\"
(Missing operator before \?)
syntax error at -e line 1, near "1\"
Execution of -e aborted due to compilation errors.
If the input is abcd I want to get abcd out, if it's ab I want to get an empty string. Where am I going wrong here?
You used regex atoms \1 and \2 (match what the first or second capture captured) outside of a regex pattern. You meant to use $1 and $2 (as you did in another spot).
Further more, dollar signs inside double-quoted strings have meaning to your shell. It's best to use single quotes around your program[1].
echo abcd | perl -pe's/(ab)(cd)?/defined($2)?$1.$2:""/e'
Simpler:
echo abcd | perl -pe's/(ab(cd)?)/defined($2)?$1:""/e'
Simpler:
echo abcd | perl -pe's/ab(?!cd)//'
Either avoid single-quotes in your program[2], or use '\'' to "escape" them.
You can usually use q{} instead of single-quotes. You can also switch to using double-quotes. Inside of double-quotes, you can use \x27 for an apostrophe.
Why torture yourself, just use a branch reset.
Find (?|(abcd)|ab())
Replace $1
And a couple of even better ways
Find abcd(*SKIP)(*FAIL)|ab
Replace ""
Find (?:abcd)*\Kab
Replace ""
These use regex wisely.
There is really no need nowadays to have to use the eval form
of the regex substitution construct s///e in conjunction with defined().
This is especially true when using the perl command line.
Good luck...

what regex to extract all data except within <> in perl?

I have string
Message <Network=Data Center> All Verified
I need to extract all string except one in angular brackets
I tried
m/(?![^<]*\\>)/s
Not giving desired result.
Removing <..> regions
It's easier to remove the <..> parts from the string and then deal with the remaining string.
Try this oneliner:
cat file | perl -pne 's/<[^>]*?>//g;'
For your sample input, this is the output:
Message All Verified
Notice the non-greedy quantifier ? is used in the regex. Also, because this is a oneliner, the s/// search-and-replace construct is applied to $_ implicit variable (which is a line from standard input). So after search & replace has run in this oneliner, the $_ will be altered(there will be no <..> regions in it). Also the -p was used in order to print the variable $_ after running the block of code. You can read more about Perl commandline switches in perlrun.
This is one solution. Below there is another one:
Capturing regions outside of <..>
On the other hand, you can(if you want) match the parts outside of the <..> regions.
In order to do that let's build a regex. First, we want a < or > free region. The following regex matches just that
$p = ([^<>]*).
Next, we want to match everything before <, and for that we can write (?:$p<) and everything after >, and that's (?:>$p).
Now if we assemble all those parts together we get (?:>$p)|(?:$p<).
Notice that (?:) is a non-capturing group.
So now there are two capturing groups (the two $p you see above) but only one will match at a time, so some of the captures will be undef. We'll have to filter those out.
Finally, we can assemble all the captures, and we're done.
cat file | perl -ne '$p="([^<>]*)";#x=grep{defined} m{(?:>$p)|(?:$p<)}g; print join(" ",#x)."\n";'
Parse::Yapp parser
You might think that using Parser::Yapp for this particular problem is a bit too much(usually, if you have something complicated to parse, you would use a grammar and a parser generator), but .. why not.. :)
Ok, so we need a grammar, here's one right here grammar_file.yp:
#header
%%
#rules
expression:
| exterior '<' interior '>' exterior
| exterior
;
exterior:
| TOK { $_[0]->YYData->{DATA} .= $_[1]; }
| expression
;
interior: TOK;
%%
#footer
sub Error { my ($parser)=shift; }
sub Lexer {
use Data::Dumper;
my($parser)=shift;
$parser->YYData->{INPUT} or return('',undef);
#$parser->YYData->{INPUT}=~s/^\s+//;
for ($parser->YYData->{INPUT}) {
return ('TOK',$1) if(s/^([^<>]+)//);
return ( $1,$1) if(s/^([<>])//);
};
}
You will notice in the grammar above that the interior is completely ignored, and only the terminals from exterior are collected.
Here's a small program that will use the parser(MyParser.pm generated from grammar_file.yp) parse.pl:
#!/usr/bin/env perl
use strict;
use warnings;
use MyParser;
my $parser=MyParser->new;
$parser->YYData->{INPUT} = "Message <Network=Data Center> All Verified";
my $value=$parser->YYParse(
yylex => \&MyParser::Lexer,
yyerror => \&MyParser::Error,
#yydebug => 0x1F,
);
my $nberr=$parser->YYNberr();
my $data=$parser->YYData->{DATA};
print "Result=$data"
And now a Makefile and we're done:
generate_parser_module:
yapp -m MyParser grammar_file.yp;
run:
perl parse.pl
all: generate_parser_module
Note
Some more Parser generators can be found here
Regexp::Grammars
Parse::RecDescent
Marpa::XS or Marpa::R2
You can do it other way: just remove the string in the angular brackets:
s#<.*>##
Or if > is not allowed:
s#<[^>]*>##
You can use sed for that:
cat yourfile |sed 's/<.*>//g' > newfile
If you need perl:
perl -i -pe "s/<.*?>//g" yourfile
Here is a compact approach. The following regex will capture your strings into Group 1:
<[^>]+>|([^<>]*)
What we are interested in here is not the overall match, but just the Group 1 matches.
So we need to iterate over Group 1 matches. I don't code in Perl, but following a recipe from the perlretut tutorial, this should do it:
while ($x =~ /<[^>]+>|([^<>]*)/g) {
print "$1","\n";
}
Please give it a try and let me know if it works for you.

How to retain the first instance of a match with sed

I have a set of tokens in data and wish to strip off the trailing ".[0-9]", however i cannot figure out how to quote the regexp properly. The First match should be all up to the . and the second the . and a number. I am intending that the first match be retained.
data="thing thing__aaa.0 thing__bbb.3 thing__ccc.5 other_aaa other_bbb other_ccc.5"
data=`echo $data | sed s/\([a-zA-Z0-9_]+\)\(\.[0-9]\)/\1/g`
echo $data
Actual output:
thing thing__aaa.0 thing__bbb.3 thing__ccc.5 other_aaa other_bbb other_ccc.5
Desired output:
thing thing__aaa thing__bbb thing__ccc other_aaa other_bbb other_ccc
The idea is that the unquoted ([a-zA-Z0-9_]+) is the first matching group, and the (\.[0-9]) matches the .number. the \1 should replace both groups with the first group.
How about just
echo $data | sed 's/\.[0-9]//g'
or if number may contain more digits, then
echo $data | sed 's/\.[0-9]\+//g'
It looks like you just want to delete all strings of the form \.[0-9]. So why not just do:
sed 's/\.[0-9]+\b//g'
(This relies on gnu sed's \b and + extensions. For other sed you can do:
sed 's/\.[0-9][0-9]*\( \|$\)/\1/g'
I normally don't encourage the use of shell specific extensions, but if you are using bash you might be happy using an array:
bash$ data=(thing thing__aaa.0 thing__bbb.3)
bash$ echo "${data[#]%.[0-9]*}"
Note that this will also delete extensions that are not all digits (ie foo.34bb), but perhaps is adequate for your needs.)