Regex does not match in Perl, while it does in other programs - regex

I have the following string:
load Add 20 percent
to accommodate
I want to get to:
load Add 20 percent to accommodate
With, e.g., regex in sublime, this is easily done by:
Regex:
([a-z])\n\s([a-z])
Replace:
$1 $2
However, in Perl, if I input this command, (adapted to test if I can match the pattern in any case):
perl -pi.orig -e 's/[a-z]\n.+to/TEST/g' file
It doesn't match anything.
Does anyone know why Perl would be different in this case, and what the correct formulation of the Perl command should be?

By default, Perl -p flag read input lines one by one. You can't thus expect your regex to match anything after \n.
Instead, you want to read the whole input at once. You can do this by using the flag -0777 (this is documented in perlrun):
perl -0777 -pi.orig -e 's/([a-z])\n\s(to)/$1 $2/' file

Just trying to help and reminding below your initial proposal for perl regex:
perl -pi.orig -e 's/[a-z]\n.+to/TEST/g' file
Note that in perl regex, [a-z] will match only one character, NOT including any whitespace. Then as a start please include a repetition specifier and include capability to also 'eat' whitespaces. Also to keep the recognized (but 'eaten') 'to' in the replacement, you must put it again in the replacement string, like finally in the below example perl program:
$str = "load Add 20 percent
to accommodate";
print "before:\n$str\n";
$str =~ s/([ a-z]+)\n\s*to/\1 to/;
print "after:\n$str\n";
This program produces the below input:
before:
load Add 20 percent
to accommodate
after:
load Add 20 percent to accommodate
Then it looks like that if I understood well what you want to do, your regexp should better look like:
s/([ a-z]+)\n\s*to/\1 to/ (please note the leading whitespace before 'a-z').

Related

Perl Regex Command Line Issue

I'm trying to use a negative lookahead in perl in command line:
echo 1.41.1 | perl -pe "s/(?![0-9]+\.[0-9]+\.)[0-9]$/2/g"
to get an incremented version that looks like this:
1.41.2
but its just returning me:
![0-9]+\.[0-9]+\.: event not found
i've tried it in regex101 (PCRE) and it works fine, so im not sure why it doesn't work here
In Bash, ! is the "history expansion character", except when escaped with a backslash or single-quotes. (Double-quotes do not disable this; that is, history expansion is supported inside double-quotes. See Difference between single and double quotes in Bash)
So, just change your double-quotes to single-quotes:
echo 1.41.1 | perl -pe 's/(?![0-9]+\.[0-9]+\.)[0-9]$/2/g'
and voilĂ :
1.41.2
I'm guessing that this expression also might work:
([0-9.]+)\.([0-9]+)
Test
perl -e'
my $name = "1.41.1";
$name =~ s/([0-9.]+)\.([0-9]+)/$1\.2/;
print "$name\n";
'
Output
1.41.2
Please see the demo here.
If you want to "increment" a number then you can't hard-code the new value but need to capture what is there and increment that
echo "1.41.1" | perl -pe's/[0-9]+\.[0-9]+\.\K([0-9]+)/$1+1/e'
Here /e modifier makes it so that the replacement side is evaluated as code, and we can +1 the captured number, what is then substituted. The \K drops previous matches so we don't need to put them back; see "Lookaround Assertions" in Extended Patterns in perlre.
The lookarounds are sometimes just the thing you want, but they increase the regex complexity (just by being there), can be tricky to get right, and hurt efficiency. They aren't needed here.
The strange output you get is because the double quotes used around the Perl program "invite" the shell to look at what's inside whereby it interprets the ! as history expansion and runs that, as explained in ruakh's post.
As an alternate to lookahead, we can use capture groups, e.g. the following will capture the version number into 3 capture groups.
(\d+)\.(\d+)\.(\d+)
If you wanted to output the captured version number as is, it would be:
\1.\2.\3
And to just replace the 3rd part with the number "2" would be:
\1.\2.2
To adapt this to the OP's question, it would be:
$ echo 1.14.1 | perl -pe 's/(\d+)\.(\d+)\.(\d+)/\1.\2.2/'
1.14.2
$

s/// returns out of place newline

I'm trying to use Perl to reorder the content of an md5 file. For each line, I want the filename without the path then the hash. The best command I've come up with is:
$ perl -pe 's|^([[:alnum:]]+).*?([^/]+)$|$2 $1|' DCIM.md5
The input file (DCIM.md5) is produced by md5sum on Linux. It looks like this:
e26ff03dc1bac80226e200c0c63d17a2 ./Path1/IMG_20150201_160548.jpg
01f92572e4c6f2ea42bd904497e4f939 ./Path 2/IMG_20150204_190528.jpg
afce027c977944188b4f97c5dd1bd101 ./Path3/Path 4/IMG_20151011_193008.jpg
The hash is matched by the first group ([[:alnum:]]+) in the
regular expression.
Then the spaces and the path to the file are
matched by .*?.
Then the filename is matched by ([^/]+).
The expression is enclosed with ^ (apparently non-necessary here)
and $. Without the $, the expression does not output what I expect.
I use | rather than / as a separator to avoid escaping it in file paths.
That command returns:
IMG_20150201_160548.jpg
e26ff03dc1bac80226e200c0c63d17a2IMG_20150204_190528.jpg
01f92572e4c6f2ea42bd904497e4f939IMG_20151011_193008.jpg
afce027c977944188b4f97c5dd1bd101IMG_20151011_195133.jpg
The matching is correct, the output sequence is correct (filename without path then hash) but the spacing is not: there's a newline after the filename. I expect it after the hash, like this:
IMG_20150201_160548.jpg e26ff03dc1bac80226e200c0c63d17a2
IMG_20150204_190528.jpg 01f92572e4c6f2ea42bd904497e4f939
IMG_20151011_193008.jpg afce027c977944188b4f97c5dd1bd101
It seems to me that my command outputs the newline character, but I don't know how to change this behavior.
Or possibly the problem comes from the shell, not the command?
Finally, some version information:
$ perl -version
This is perl 5, version 22, subversion 1 (v5.22.1) built for i686-linux-gnu-thread-multi-64int
(with 69 registered patches, see perl -V for more detail)
[^/]+ matches newlines, so the ones in your input are part of $2, which gets put first in your transformed $_ (And there's no newline in $1 so there's no newline at the end of $_...)
Solution: Read up on the -l option from perlrun. In particular:
-l[octnum]
enables automatic line-ending processing. It has two separate effects. First, it automatically chomps $/ (the input record separator) when used with -n or -p. Second, it assigns $\ (the output record separator) to have the value of octnum so that any print statements will have that separator added back on. If octnum is omitted, sets $\ to the current value of $/ .
Alternate solution, which uses lots of concepts from other answers, and comments ...
$ perl -pe 's|(\p{hex}+).*?([^/]+?)$|$2 $1|' DCIM.md5
... and explanation.
After investigating all the answers and trying to figure them out, I've decided that the base of the problem is that the [^/]+ is greedy. Its greediness causes it to capture the newline; it ignores the $ anchor.
This was hard for me to figure out, since I did a lot of parsing using sed before using Perl, and even a greedy wildcard won't capture a newline in sed. Hopefully this post will help those who (being used to sed as I am) are also wondering (as I did) why the $ isn't acting "as I expect it to."
We can see the "greedy" issue by trying what I'll post as another, alternate answer.
Write the file:
$ cat > DCIM.md5<<EOF
> e26ff03dc1bac80226e200c0c63d17a2 ./Path1/IMG_20150201_160548.jpg
> 01f92572e4c6f2ea42bd904497e4f939 ./Path 2/IMG_20150204_190528.jpg
> afce027c977944188b4f97c5dd1bd101 ./Path3/Path 4/IMG_20151011_193008.jpg
> EOF
Get rid of the greedy [^/]+ by changing it to [^/]+?. Parse.
$ perl -pe 's|([[:alnum:]]+).*?([^/]+?)$|$2 $1|' DCIM.md5
IMG_20150201_160548.jpg e26ff03dc1bac80226e200c0c63d17a2
IMG_20150204_190528.jpg 01f92572e4c6f2ea42bd904497e4f939
IMG_20151011_193008.jpg afce027c977944188b4f97c5dd1bd101
Desired output accomplished.
The accepted answer, by #Shawn,
$ perl -lpe 's|^([[:alnum:]]+).*?([^/]+)$|$2 $1|' DCIM.md5
basically changes the $ anchor so as to behave the way a sed person would expect it to.
The answer by #CrafterKolyan takes care of the greedy [^/] capturing the newline by saying you can't have a forward-slash or a newline. This answer still needs the $ anchor to prevent the following situation
1) .* captures the empty string (0 or more of any character)
2) [^/\n]+ captures . .
The answer by #Borodin takes a quite different approach, but it's a great concept.
#Borodin, in addition, made a great comment that allows a more-precise/more-exact version of this answer, which is the version I put at the top of this post.
Finally, if one wants to follow the Perl programming model, here's another alternative.
$ perl -pe 's|([[:xdigit:]]+).*?([^/]+?)(\n\|\Z)|$2 $1$3|' DCIM.md5
P.S. Because sed isn't quite like perl (no non-greedy wildcards,) here's a sed example that shows the behavior I discuss.
$ sed 's|^\([[:alnum:]]\+\).*/\([^/]\+\)$|\2 \1|' DCIM.md5
This is basically a "direct translation" of the perl expression except for the extra '/' before the [^/] stuff. I hope it will help those comparing sed and perl.
use [^/\n] instead of [^/]:
perl -pe 's|^([[:alnum:]]+).*?([^/\n]+)$|$2 $1|' DCIM.md5
Doing a substitution leaves you having to write a regex pattern that matches everything you don't want as well as everything you do. It's usually much better to match just the parts you need and build another string from them
Like this
for ( <> ) {
die unless m< (\w++) .*? ([^/\s]+) \s* \z >x;
print "$2 $1\n";
}
or if you must have a one-liner
perl -ne 'die unless m< (\w++) .*? ([^/\s]+) \s*\z >x; print "$2 $1\n";' myfile.md5
output
IMG_20150201_160548.jpg e26ff03dc1bac80226e200c0c63d17a2
IMG_20150204_190528.jpg 01f92572e4c6f2ea42bd904497e4f939
IMG_20151011_193008.jpg afce027c977944188b4f97c5dd1bd101

Regex to extract everything until it encounters a number after a slash

I am looking to extract everything form a string but ignore everything after encountering numbers after a slash(alphanumeric allowed)
Examples:
http://www.test.com/products/cards/product_code100/12345/something_else
http://www.test.com/products/123abc/45678/
Desired output -
http://www.test.com/products/cards/product_code100/
http://www.test.com/products/123abc/
The following regex gives me everything in backreferences but it'll be great if I could get rid of numbers after a slash-
^(.*:)//([a-z\-.]+)(:[0-9]+)?(.*)
Additional Information - Languauge independent regex needed.
Many Thanks
this should work with most languages and should produce the desired output
(http://.*)(?=/\d+(?!\w+))
It takes every character until it finds (lookahead) \ followed by a number.
If you'd try to match
http://www.test.com/products/123abc/
or
http://www.test.com/products/123abc
it just would not find a match and you could be sure the string checked doesnt encounter a pure number after a slash
Example in Perl:
echo "http://...." | perl -pe 's/(.*\/)\d+\/.*/$1/'
or:
echo "http://...." | perl -ne 'print "$1\n" if /(.*\/)\d+\/.*/'
Edit: It's true what #creinig noted in his comment - there is no such thing as generic regex. Nonetheless, Perl is widely used, so it's an option.

regexp greedness: shrinking long path

Please have a look at my mind-breaker.
I'd stuck in shrinking with regex some long path, like this:
/12345/123456/1234/123/12/1/1234567/13245678/123456789/1234567890
I'd like to transform this path to the following form:
/123/123/123/123/12/1/123/123/123/123
each "directory" in a path abbreviates to only 3 first characters
LONG_PATH="/12345/123456/1234/123/12/1/1234567/13245678/123456789/1234567890"
perl -pe "s#/(.{1,3})[^/]*?(/|$)#/\1\2#g" <<<$LONG_PATH
/123/123456/123/123/12//1234567/132/123456789/123
sed -E "s#/(.{1,3})[^/]*?(/|$)#/\1\2#g" <<<$LONG_PATH
/123/123456/123/123/12//1234567/132/123456789/123
I have tried also:
perl -pe "s,/(.)(.)?(.)?[^/]*+,/\1\2\3,g" <<<$LONG_PATH
/123/123/123/123/12//123/132/123/123
and many another, no "luck" - I still have no idea about.
Please point me a right way to success.
Match up to three non-slash characters and capture them. Then match the rest until the next slash. Replace by the capture:
"s#(/[^/]{3})[^/]*#\1#g"
There is no need for ungreediness or anything here, because the negated character class is mutually exclusive with the / or $.
EDIT: Although you seem to know this I should probably clarify for future visitors that this will work with either perl -pe... or sed -E... as you have used it in your question. The regex could also be used as is with sed -r.... If you leave out the -E or -r option, then (as usual) you will need to escape both the parentheses and curly brackets:
sed "s#\(/[^/]\{3\}\)[^/]*#\1#g" filename
Note also as ikegami points out that in Perl you should rather use $1 in the replacement than \1.
You could do it like this:
perl -pe's#[^/]{3}\K[^/]*##g'
/12345/123456/1234/123/12/1/1234567/13245678/123456789/1234567890
/123/123/123/123/12/1/123/132/123/123
Find 3 non-slashes, and keep (\K) them, remove the following characters up until the next slash.
As ikegami pointed out, it is not required to match less than three characters, in which case a lookbehind assertion can be used instead of \K. The benefit is that \K requires perl v5.10, and I believe look-around assertions predate that.
perl -pe 's#(?<=[^/]{3})[^/]*##g'
The best way seems to use the File::Spec module to split and recombine a path. An intermediate call to map will reduce each path segment to its first three characters. This program demonstrates
use strict;
use warnings;
use File::Spec;
my $path = '/12345/123456/1234/123/12/1/1234567/13245678/123456789/1234567890';
my $newpath = File::Spec->catdir(map substr($_, 0, 3), File::Spec->splitdir($path));
print $newpath;
output
/123/123/123/123/12/1/123/132/123/123

Is there a truly universal wildcard in Grep? [duplicate]

This question already has answers here:
How do I match any character across multiple lines in a regular expression?
(26 answers)
Closed 3 years ago.
Really basic question here. So I'm told that a dot . matches any character EXCEPT a line break. I'm looking for something that matches any character, including line breaks.
All I want to do is to capture all the text in a website page between two specific strings, stripping the header and the footer. Something like HEADER TEXT(.+)FOOTER TEXT and then extract what's in the parentheses, but I can't find a way to include all text AND line breaks between header and footer, does this make sense? Thanks in advance!
When I need to match several characters, including line breaks, I do:
[\s\S]*?
Note I'm using a non-greedy pattern
You could do it with Perl:
$ perl -ne 'print if /HEADER TEXT/ .. /FOOTER TEXT/' file.html
To print only the text between the delimiters, use
$ perl -000 -lne 'print $1 while /HEADER TEXT(.+?)FOOTER TEXT/sg' file.html
The /s switch makes the regular expression matcher treat the entire string as a single line, which means dot matches newlines, and /g means match as many times as possible.
The examples above assume you're cranking on HTML files on the local disk. If you need to fetch them first, use get from LWP::Simple:
$ perl -MLWP::Simple -le '$_ = get "http://stackoverflow.com";
print $1 while m!<head>(.+?)</head>!sg'
Please note that parsing HTML with regular expressions as above does not work in the general case! If you're working on a quick-and-dirty scanner, fine, but for an application that needs to be more robust, use a real parser.
By definition, grep looks for lines which match; it reads a line, sees whether it matches, and prints the line.
One possible way to do what you want is with sed:
sed -n '/HEADER TEXT/,/FOOTER TEXT/p' "$#"
This prints from the first line that matches 'HEADER TEXT' to the first line that matches 'FOOTER TEXT', and then iterates; the '-n' stops the default 'print each line' operation. This won't work well if the header and footer text appear on the same line.
To do what you want, I'd probably use perl (but you could use Python if you prefer). I'd consider slurping the whole file, and then use a suitably qualified regex to find the matching portions of the file. However, the Perl one-liner given by '#gbacon' is an almost exact transliteration into Perl of the 'sed' script above and is neater than slurping.
The man page of grep says:
grep, egrep, fgrep, rgrep - print lines matching a pattern
grep is not made for matching more than a single line. You should try to solve this task with perl or awk.
As this is tagged with 'bbedit' and BBedit supports Perl-Style Pattern Modifiers you can allow the dot to match linebreaks with the switch (?s)
(?s).
will match ANY character. And yes,
(?s).+
will match the whole text.
As pointed elsewhere, grep will work for single line stuff.
For multiple-lines (in ruby with Regexp::MULTILINE, or in python, awk, sed, whatever), "\s" should also capture line breaks, so
HEADER TEXT(.*\s*)FOOTER TEXT
might work ...
here's one way to do it with gawk, if you have it
awk -vRS="FOOTER" '/HEADER/{gsub(/.*HEADER/,"");print}' file