Perl Regex Command Line Issue - regex

I'm trying to use a negative lookahead in perl in command line:
echo 1.41.1 | perl -pe "s/(?![0-9]+\.[0-9]+\.)[0-9]$/2/g"
to get an incremented version that looks like this:
1.41.2
but its just returning me:
![0-9]+\.[0-9]+\.: event not found
i've tried it in regex101 (PCRE) and it works fine, so im not sure why it doesn't work here

In Bash, ! is the "history expansion character", except when escaped with a backslash or single-quotes. (Double-quotes do not disable this; that is, history expansion is supported inside double-quotes. See Difference between single and double quotes in Bash)
So, just change your double-quotes to single-quotes:
echo 1.41.1 | perl -pe 's/(?![0-9]+\.[0-9]+\.)[0-9]$/2/g'
and voilĂ :
1.41.2

I'm guessing that this expression also might work:
([0-9.]+)\.([0-9]+)
Test
perl -e'
my $name = "1.41.1";
$name =~ s/([0-9.]+)\.([0-9]+)/$1\.2/;
print "$name\n";
'
Output
1.41.2
Please see the demo here.

If you want to "increment" a number then you can't hard-code the new value but need to capture what is there and increment that
echo "1.41.1" | perl -pe's/[0-9]+\.[0-9]+\.\K([0-9]+)/$1+1/e'
Here /e modifier makes it so that the replacement side is evaluated as code, and we can +1 the captured number, what is then substituted. The \K drops previous matches so we don't need to put them back; see "Lookaround Assertions" in Extended Patterns in perlre.
The lookarounds are sometimes just the thing you want, but they increase the regex complexity (just by being there), can be tricky to get right, and hurt efficiency. They aren't needed here.
The strange output you get is because the double quotes used around the Perl program "invite" the shell to look at what's inside whereby it interprets the ! as history expansion and runs that, as explained in ruakh's post.

As an alternate to lookahead, we can use capture groups, e.g. the following will capture the version number into 3 capture groups.
(\d+)\.(\d+)\.(\d+)
If you wanted to output the captured version number as is, it would be:
\1.\2.\3
And to just replace the 3rd part with the number "2" would be:
\1.\2.2
To adapt this to the OP's question, it would be:
$ echo 1.14.1 | perl -pe 's/(\d+)\.(\d+)\.(\d+)/\1.\2.2/'
1.14.2
$

Related

Regex does not match in Perl, while it does in other programs

I have the following string:
load Add 20 percent
to accommodate
I want to get to:
load Add 20 percent to accommodate
With, e.g., regex in sublime, this is easily done by:
Regex:
([a-z])\n\s([a-z])
Replace:
$1 $2
However, in Perl, if I input this command, (adapted to test if I can match the pattern in any case):
perl -pi.orig -e 's/[a-z]\n.+to/TEST/g' file
It doesn't match anything.
Does anyone know why Perl would be different in this case, and what the correct formulation of the Perl command should be?
By default, Perl -p flag read input lines one by one. You can't thus expect your regex to match anything after \n.
Instead, you want to read the whole input at once. You can do this by using the flag -0777 (this is documented in perlrun):
perl -0777 -pi.orig -e 's/([a-z])\n\s(to)/$1 $2/' file
Just trying to help and reminding below your initial proposal for perl regex:
perl -pi.orig -e 's/[a-z]\n.+to/TEST/g' file
Note that in perl regex, [a-z] will match only one character, NOT including any whitespace. Then as a start please include a repetition specifier and include capability to also 'eat' whitespaces. Also to keep the recognized (but 'eaten') 'to' in the replacement, you must put it again in the replacement string, like finally in the below example perl program:
$str = "load Add 20 percent
to accommodate";
print "before:\n$str\n";
$str =~ s/([ a-z]+)\n\s*to/\1 to/;
print "after:\n$str\n";
This program produces the below input:
before:
load Add 20 percent
to accommodate
after:
load Add 20 percent to accommodate
Then it looks like that if I understood well what you want to do, your regexp should better look like:
s/([ a-z]+)\n\s*to/\1 to/ (please note the leading whitespace before 'a-z').

s/// returns out of place newline

I'm trying to use Perl to reorder the content of an md5 file. For each line, I want the filename without the path then the hash. The best command I've come up with is:
$ perl -pe 's|^([[:alnum:]]+).*?([^/]+)$|$2 $1|' DCIM.md5
The input file (DCIM.md5) is produced by md5sum on Linux. It looks like this:
e26ff03dc1bac80226e200c0c63d17a2 ./Path1/IMG_20150201_160548.jpg
01f92572e4c6f2ea42bd904497e4f939 ./Path 2/IMG_20150204_190528.jpg
afce027c977944188b4f97c5dd1bd101 ./Path3/Path 4/IMG_20151011_193008.jpg
The hash is matched by the first group ([[:alnum:]]+) in the
regular expression.
Then the spaces and the path to the file are
matched by .*?.
Then the filename is matched by ([^/]+).
The expression is enclosed with ^ (apparently non-necessary here)
and $. Without the $, the expression does not output what I expect.
I use | rather than / as a separator to avoid escaping it in file paths.
That command returns:
IMG_20150201_160548.jpg
e26ff03dc1bac80226e200c0c63d17a2IMG_20150204_190528.jpg
01f92572e4c6f2ea42bd904497e4f939IMG_20151011_193008.jpg
afce027c977944188b4f97c5dd1bd101IMG_20151011_195133.jpg
The matching is correct, the output sequence is correct (filename without path then hash) but the spacing is not: there's a newline after the filename. I expect it after the hash, like this:
IMG_20150201_160548.jpg e26ff03dc1bac80226e200c0c63d17a2
IMG_20150204_190528.jpg 01f92572e4c6f2ea42bd904497e4f939
IMG_20151011_193008.jpg afce027c977944188b4f97c5dd1bd101
It seems to me that my command outputs the newline character, but I don't know how to change this behavior.
Or possibly the problem comes from the shell, not the command?
Finally, some version information:
$ perl -version
This is perl 5, version 22, subversion 1 (v5.22.1) built for i686-linux-gnu-thread-multi-64int
(with 69 registered patches, see perl -V for more detail)
[^/]+ matches newlines, so the ones in your input are part of $2, which gets put first in your transformed $_ (And there's no newline in $1 so there's no newline at the end of $_...)
Solution: Read up on the -l option from perlrun. In particular:
-l[octnum]
enables automatic line-ending processing. It has two separate effects. First, it automatically chomps $/ (the input record separator) when used with -n or -p. Second, it assigns $\ (the output record separator) to have the value of octnum so that any print statements will have that separator added back on. If octnum is omitted, sets $\ to the current value of $/ .
Alternate solution, which uses lots of concepts from other answers, and comments ...
$ perl -pe 's|(\p{hex}+).*?([^/]+?)$|$2 $1|' DCIM.md5
... and explanation.
After investigating all the answers and trying to figure them out, I've decided that the base of the problem is that the [^/]+ is greedy. Its greediness causes it to capture the newline; it ignores the $ anchor.
This was hard for me to figure out, since I did a lot of parsing using sed before using Perl, and even a greedy wildcard won't capture a newline in sed. Hopefully this post will help those who (being used to sed as I am) are also wondering (as I did) why the $ isn't acting "as I expect it to."
We can see the "greedy" issue by trying what I'll post as another, alternate answer.
Write the file:
$ cat > DCIM.md5<<EOF
> e26ff03dc1bac80226e200c0c63d17a2 ./Path1/IMG_20150201_160548.jpg
> 01f92572e4c6f2ea42bd904497e4f939 ./Path 2/IMG_20150204_190528.jpg
> afce027c977944188b4f97c5dd1bd101 ./Path3/Path 4/IMG_20151011_193008.jpg
> EOF
Get rid of the greedy [^/]+ by changing it to [^/]+?. Parse.
$ perl -pe 's|([[:alnum:]]+).*?([^/]+?)$|$2 $1|' DCIM.md5
IMG_20150201_160548.jpg e26ff03dc1bac80226e200c0c63d17a2
IMG_20150204_190528.jpg 01f92572e4c6f2ea42bd904497e4f939
IMG_20151011_193008.jpg afce027c977944188b4f97c5dd1bd101
Desired output accomplished.
The accepted answer, by #Shawn,
$ perl -lpe 's|^([[:alnum:]]+).*?([^/]+)$|$2 $1|' DCIM.md5
basically changes the $ anchor so as to behave the way a sed person would expect it to.
The answer by #CrafterKolyan takes care of the greedy [^/] capturing the newline by saying you can't have a forward-slash or a newline. This answer still needs the $ anchor to prevent the following situation
1) .* captures the empty string (0 or more of any character)
2) [^/\n]+ captures . .
The answer by #Borodin takes a quite different approach, but it's a great concept.
#Borodin, in addition, made a great comment that allows a more-precise/more-exact version of this answer, which is the version I put at the top of this post.
Finally, if one wants to follow the Perl programming model, here's another alternative.
$ perl -pe 's|([[:xdigit:]]+).*?([^/]+?)(\n\|\Z)|$2 $1$3|' DCIM.md5
P.S. Because sed isn't quite like perl (no non-greedy wildcards,) here's a sed example that shows the behavior I discuss.
$ sed 's|^\([[:alnum:]]\+\).*/\([^/]\+\)$|\2 \1|' DCIM.md5
This is basically a "direct translation" of the perl expression except for the extra '/' before the [^/] stuff. I hope it will help those comparing sed and perl.
use [^/\n] instead of [^/]:
perl -pe 's|^([[:alnum:]]+).*?([^/\n]+)$|$2 $1|' DCIM.md5
Doing a substitution leaves you having to write a regex pattern that matches everything you don't want as well as everything you do. It's usually much better to match just the parts you need and build another string from them
Like this
for ( <> ) {
die unless m< (\w++) .*? ([^/\s]+) \s* \z >x;
print "$2 $1\n";
}
or if you must have a one-liner
perl -ne 'die unless m< (\w++) .*? ([^/\s]+) \s*\z >x; print "$2 $1\n";' myfile.md5
output
IMG_20150201_160548.jpg e26ff03dc1bac80226e200c0c63d17a2
IMG_20150204_190528.jpg 01f92572e4c6f2ea42bd904497e4f939
IMG_20151011_193008.jpg afce027c977944188b4f97c5dd1bd101

Conditional in perl regex replacement

I'm trying to return different replacement results with a perl regex one-liner if it matches a group. So far I've got this:
echo abcd | perl -pe "s/(ab)(cd)?/defined($2)?\1\2:''/e"
But I get
Backslash found where operator expected at -e line 1, near "1\"
(Missing operator before \?)
syntax error at -e line 1, near "1\"
Execution of -e aborted due to compilation errors.
If the input is abcd I want to get abcd out, if it's ab I want to get an empty string. Where am I going wrong here?
You used regex atoms \1 and \2 (match what the first or second capture captured) outside of a regex pattern. You meant to use $1 and $2 (as you did in another spot).
Further more, dollar signs inside double-quoted strings have meaning to your shell. It's best to use single quotes around your program[1].
echo abcd | perl -pe's/(ab)(cd)?/defined($2)?$1.$2:""/e'
Simpler:
echo abcd | perl -pe's/(ab(cd)?)/defined($2)?$1:""/e'
Simpler:
echo abcd | perl -pe's/ab(?!cd)//'
Either avoid single-quotes in your program[2], or use '\'' to "escape" them.
You can usually use q{} instead of single-quotes. You can also switch to using double-quotes. Inside of double-quotes, you can use \x27 for an apostrophe.
Why torture yourself, just use a branch reset.
Find (?|(abcd)|ab())
Replace $1
And a couple of even better ways
Find abcd(*SKIP)(*FAIL)|ab
Replace ""
Find (?:abcd)*\Kab
Replace ""
These use regex wisely.
There is really no need nowadays to have to use the eval form
of the regex substitution construct s///e in conjunction with defined().
This is especially true when using the perl command line.
Good luck...

using grep linux command with perl regex + capturing groups

so I've done some research on the subject and I didn't quite find the perfect solution.
For example I have a string inside a variable.
var="a1b1c2"
now what I want to do is match only "a" follow by any digit, but I only want it to return the number after "a"
To match it a rule such as
'a\d'
and since I only need the digit, I tried with
'a(\d)'
and maybe it did capture it somewhere, but I don't know where, the output here is still "a1"
I also tried a non-capturing group to ignore the "a" in the output, but no effect in perl regex:
'(?:a)\d'
for reference, this is the full command in my terminal:
[root#host ~]# var="a1b1c2"
[root#host ~]# echo $var |grep -oP "a(\d)"
a1 <--output
Probably it's also possible without the -P (some not-perl regex format), I'm thankful for every answer :)
EDIT:
using
\K
is not really the solution, since I don't necessarily need the last part of the match.
EDIT2:
I need to able to get any part of the match, for instance:
[root#host ~]# var="a1b1c2"
[root#host ~]# echo $var |grep -oP "(a)\d"
a1 <--output
but the wanted output in this case would be "a"
EDIT3:
The problem is nearly solved using "look-behind assertions" such as:
(?<=a)\d
will not return the letter "a", only the digit following it, but it needs a fixed length, for example it cannot be used as:
(?<=\w+)\d
EDIT4:
The best way so far is either using perl or combine a combination of look-behind assertions and the \K but it still seems to have some limitations. For example:
1234_foo_1234_bar
1234567_foo_123456789_bar
1_foo_12345_bar
if "foo" and "bar" are place-holders for words that don't always have the same length,
there is no way to match all above examples while output "foobar", since the
number between them doesn't have a fixed length, while it can't be done with \K since we need "foo"
Any further suggestions are still appreciated :)
After some testing I found out, that the pattern inside the look-behind assertion needs to be fixed length (something like (?<=\w+)something will not work, any suggestions?
As I posted and deleted my answer previously because you stated it did not fit your needs:
Most of the time, you can avoid variable length lookbehinds by using \K. This resets the starting point of the reported match and any previously consumed characters are no longer included. (throws away everything that it has matched up to that point.)
The key difference between using \K and a lookbehind is that, a lookbehind does not allow the use of quantifiers: the length of what you are looking for must be fixed. But \K can be placed anywhere in a pattern, so you are able to use any quantifiers.
As you can see in the below example, using a quantifier in the lookbheind will not work.
echo 'foosomething' | grep -Po '(?<=\w+)something'
#=> grep: lookbehind assertion is not fixed length
So you could do:
echo 'foosomething' | grep -Po '\w+\Ksomething'
#=> something
To get a substring only between two patterns, you can add Positive Lookahead into the mix.
echo 'foosomethingbar' | grep -Po 'foo\K.*?(?=bar)'
#=> something
Or used fixed Lookbehind combined with Lookahead.
echo 'foosomethingbar' | grep -Po '(?<=foo).*?(?=bar)'
#=> something
The pattern (?<=a)\d uses a look-behind assertion to only print a digit following the letter 'a'. This works with GNU grep -Po, ack -o, and pcregrep -o. The assertion is zero width, so it isn't included in the match.
You can use Perl directly, accessing the environment variables through the %ENV hash:
perl -lwe 'print $ENV{var} =~ /a(\d+)/;'
It will only print the capture, inside the parentheses.

Regex to extract everything until it encounters a number after a slash

I am looking to extract everything form a string but ignore everything after encountering numbers after a slash(alphanumeric allowed)
Examples:
http://www.test.com/products/cards/product_code100/12345/something_else
http://www.test.com/products/123abc/45678/
Desired output -
http://www.test.com/products/cards/product_code100/
http://www.test.com/products/123abc/
The following regex gives me everything in backreferences but it'll be great if I could get rid of numbers after a slash-
^(.*:)//([a-z\-.]+)(:[0-9]+)?(.*)
Additional Information - Languauge independent regex needed.
Many Thanks
this should work with most languages and should produce the desired output
(http://.*)(?=/\d+(?!\w+))
It takes every character until it finds (lookahead) \ followed by a number.
If you'd try to match
http://www.test.com/products/123abc/
or
http://www.test.com/products/123abc
it just would not find a match and you could be sure the string checked doesnt encounter a pure number after a slash
Example in Perl:
echo "http://...." | perl -pe 's/(.*\/)\d+\/.*/$1/'
or:
echo "http://...." | perl -ne 'print "$1\n" if /(.*\/)\d+\/.*/'
Edit: It's true what #creinig noted in his comment - there is no such thing as generic regex. Nonetheless, Perl is widely used, so it's an option.