s/// returns out of place newline - regex

I'm trying to use Perl to reorder the content of an md5 file. For each line, I want the filename without the path then the hash. The best command I've come up with is:
$ perl -pe 's|^([[:alnum:]]+).*?([^/]+)$|$2 $1|' DCIM.md5
The input file (DCIM.md5) is produced by md5sum on Linux. It looks like this:
e26ff03dc1bac80226e200c0c63d17a2 ./Path1/IMG_20150201_160548.jpg
01f92572e4c6f2ea42bd904497e4f939 ./Path 2/IMG_20150204_190528.jpg
afce027c977944188b4f97c5dd1bd101 ./Path3/Path 4/IMG_20151011_193008.jpg
The hash is matched by the first group ([[:alnum:]]+) in the
regular expression.
Then the spaces and the path to the file are
matched by .*?.
Then the filename is matched by ([^/]+).
The expression is enclosed with ^ (apparently non-necessary here)
and $. Without the $, the expression does not output what I expect.
I use | rather than / as a separator to avoid escaping it in file paths.
That command returns:
IMG_20150201_160548.jpg
e26ff03dc1bac80226e200c0c63d17a2IMG_20150204_190528.jpg
01f92572e4c6f2ea42bd904497e4f939IMG_20151011_193008.jpg
afce027c977944188b4f97c5dd1bd101IMG_20151011_195133.jpg
The matching is correct, the output sequence is correct (filename without path then hash) but the spacing is not: there's a newline after the filename. I expect it after the hash, like this:
IMG_20150201_160548.jpg e26ff03dc1bac80226e200c0c63d17a2
IMG_20150204_190528.jpg 01f92572e4c6f2ea42bd904497e4f939
IMG_20151011_193008.jpg afce027c977944188b4f97c5dd1bd101
It seems to me that my command outputs the newline character, but I don't know how to change this behavior.
Or possibly the problem comes from the shell, not the command?
Finally, some version information:
$ perl -version
This is perl 5, version 22, subversion 1 (v5.22.1) built for i686-linux-gnu-thread-multi-64int
(with 69 registered patches, see perl -V for more detail)

[^/]+ matches newlines, so the ones in your input are part of $2, which gets put first in your transformed $_ (And there's no newline in $1 so there's no newline at the end of $_...)
Solution: Read up on the -l option from perlrun. In particular:
-l[octnum]
enables automatic line-ending processing. It has two separate effects. First, it automatically chomps $/ (the input record separator) when used with -n or -p. Second, it assigns $\ (the output record separator) to have the value of octnum so that any print statements will have that separator added back on. If octnum is omitted, sets $\ to the current value of $/ .

Alternate solution, which uses lots of concepts from other answers, and comments ...
$ perl -pe 's|(\p{hex}+).*?([^/]+?)$|$2 $1|' DCIM.md5
... and explanation.
After investigating all the answers and trying to figure them out, I've decided that the base of the problem is that the [^/]+ is greedy. Its greediness causes it to capture the newline; it ignores the $ anchor.
This was hard for me to figure out, since I did a lot of parsing using sed before using Perl, and even a greedy wildcard won't capture a newline in sed. Hopefully this post will help those who (being used to sed as I am) are also wondering (as I did) why the $ isn't acting "as I expect it to."
We can see the "greedy" issue by trying what I'll post as another, alternate answer.
Write the file:
$ cat > DCIM.md5<<EOF
> e26ff03dc1bac80226e200c0c63d17a2 ./Path1/IMG_20150201_160548.jpg
> 01f92572e4c6f2ea42bd904497e4f939 ./Path 2/IMG_20150204_190528.jpg
> afce027c977944188b4f97c5dd1bd101 ./Path3/Path 4/IMG_20151011_193008.jpg
> EOF
Get rid of the greedy [^/]+ by changing it to [^/]+?. Parse.
$ perl -pe 's|([[:alnum:]]+).*?([^/]+?)$|$2 $1|' DCIM.md5
IMG_20150201_160548.jpg e26ff03dc1bac80226e200c0c63d17a2
IMG_20150204_190528.jpg 01f92572e4c6f2ea42bd904497e4f939
IMG_20151011_193008.jpg afce027c977944188b4f97c5dd1bd101
Desired output accomplished.
The accepted answer, by #Shawn,
$ perl -lpe 's|^([[:alnum:]]+).*?([^/]+)$|$2 $1|' DCIM.md5
basically changes the $ anchor so as to behave the way a sed person would expect it to.
The answer by #CrafterKolyan takes care of the greedy [^/] capturing the newline by saying you can't have a forward-slash or a newline. This answer still needs the $ anchor to prevent the following situation
1) .* captures the empty string (0 or more of any character)
2) [^/\n]+ captures . .
The answer by #Borodin takes a quite different approach, but it's a great concept.
#Borodin, in addition, made a great comment that allows a more-precise/more-exact version of this answer, which is the version I put at the top of this post.
Finally, if one wants to follow the Perl programming model, here's another alternative.
$ perl -pe 's|([[:xdigit:]]+).*?([^/]+?)(\n\|\Z)|$2 $1$3|' DCIM.md5
P.S. Because sed isn't quite like perl (no non-greedy wildcards,) here's a sed example that shows the behavior I discuss.
$ sed 's|^\([[:alnum:]]\+\).*/\([^/]\+\)$|\2 \1|' DCIM.md5
This is basically a "direct translation" of the perl expression except for the extra '/' before the [^/] stuff. I hope it will help those comparing sed and perl.

use [^/\n] instead of [^/]:
perl -pe 's|^([[:alnum:]]+).*?([^/\n]+)$|$2 $1|' DCIM.md5

Doing a substitution leaves you having to write a regex pattern that matches everything you don't want as well as everything you do. It's usually much better to match just the parts you need and build another string from them
Like this
for ( <> ) {
die unless m< (\w++) .*? ([^/\s]+) \s* \z >x;
print "$2 $1\n";
}
or if you must have a one-liner
perl -ne 'die unless m< (\w++) .*? ([^/\s]+) \s*\z >x; print "$2 $1\n";' myfile.md5
output
IMG_20150201_160548.jpg e26ff03dc1bac80226e200c0c63d17a2
IMG_20150204_190528.jpg 01f92572e4c6f2ea42bd904497e4f939
IMG_20151011_193008.jpg afce027c977944188b4f97c5dd1bd101

Related

Regex does not match in Perl, while it does in other programs

I have the following string:
load Add 20 percent
to accommodate
I want to get to:
load Add 20 percent to accommodate
With, e.g., regex in sublime, this is easily done by:
Regex:
([a-z])\n\s([a-z])
Replace:
$1 $2
However, in Perl, if I input this command, (adapted to test if I can match the pattern in any case):
perl -pi.orig -e 's/[a-z]\n.+to/TEST/g' file
It doesn't match anything.
Does anyone know why Perl would be different in this case, and what the correct formulation of the Perl command should be?
By default, Perl -p flag read input lines one by one. You can't thus expect your regex to match anything after \n.
Instead, you want to read the whole input at once. You can do this by using the flag -0777 (this is documented in perlrun):
perl -0777 -pi.orig -e 's/([a-z])\n\s(to)/$1 $2/' file
Just trying to help and reminding below your initial proposal for perl regex:
perl -pi.orig -e 's/[a-z]\n.+to/TEST/g' file
Note that in perl regex, [a-z] will match only one character, NOT including any whitespace. Then as a start please include a repetition specifier and include capability to also 'eat' whitespaces. Also to keep the recognized (but 'eaten') 'to' in the replacement, you must put it again in the replacement string, like finally in the below example perl program:
$str = "load Add 20 percent
to accommodate";
print "before:\n$str\n";
$str =~ s/([ a-z]+)\n\s*to/\1 to/;
print "after:\n$str\n";
This program produces the below input:
before:
load Add 20 percent
to accommodate
after:
load Add 20 percent to accommodate
Then it looks like that if I understood well what you want to do, your regexp should better look like:
s/([ a-z]+)\n\s*to/\1 to/ (please note the leading whitespace before 'a-z').

sed : match all instances of regex in infile1.txt, and output only these to outfile2.txt

I have a text file infile1 with 1,000's of lines.
I wish to use sed to extract the occuring instances of a regex pattern match to outfile2.
NB
Each instance of the regex pattern match may occur more than once on each line of infile1.
Each instance of the extracted regex pattern should be printed to a new line in outfile2.
Does anyone know the syntax within sed to place the regex into?
ps the regex pattern is
\(Google[ ]{1,3}“[a-zA-Z0-9 ]{1,100}[., ]{0,3}”\)
Thank you :)
I think you want
grep -oE 'Google[ ]{1,3}"[a-zA-Z0-9 ]{1,100}[., ]{0,3}"' filename
-o tells grep to print only the matches, each on a line of its own, and -E instructs it to interpret the regex in extended POSIX syntax, which your regex appears to be.
Note that [ ] could be replaced with just a space, and you might want to use [[:alnum:] ] instead of [a-zA-Z0-9 ] to cover umlauts and suchlike if they exist in the current locale.
Addendum: It is also possible to do this with sed. I don't recommend it, but you could write (using GNU sed):
sed -rn 's/Google[ ]{1,3}"[A-Za-z0-9 ]{1,100}[., ]{0,3}"/\n&\n/g; s/[^\n]*\n([^\n]*\n)/\1/g; s/\n[^\n]*$//p' filename
To make this work with older versions of BSD sed, use -En instead of -rn. -r and -E enable extended regex syntax. -r was historically used by GNU sed, -E by BSD sed; newer versions of them support both for compatibility. -n disables auto-printing.
The code works as follows:
# mark all occurrences of the regex by circumscribing them with newlines
s/Google[ ]{1,3}"[A-Za-z0-9 ]{1,100}[., ]{0,3}"/\n&\n/g
# Isolate every other line from the pattern space (the matches). This will
# leave the part behind the last match...
s/[^\n]*\n([^\n]*\n)/\1/g
# ...so we remove it afterwards and print the result of the transformation if it
# happened (the s///p flag does that). The transformation will not happen if
# there were no matches in the line (because then no newlines will have been
# inserted), so in those cases nothing will be printed.
s/\n[^\n]*$//p
It can be done with sed too, but it isn't pretty:
sed -n ':start /foo/{ h; s/\(foo\).*/\1/; s/.*\(foo\)/\1/; p; g; s/foo\(.*\)/\1/; b start; }' infile1 >outfile2
-- provided that you replace the four occurences of foo above with your pattern Google {1,3}“[a-zA-Z0-9 ]{1,100}[., ]{0,3}”.
Yeah, I told you it isn't pretty. :)

Grep/Sed between two tags with multiline

I have many files from which I need to get information.
Example of my files:
first file content:
"test This info i need grep</singleline>"
and
second file content (with two lines):
"test This info=
i need grep too</singleline>"
in results I need grep this text: from first file - "This info i need grep" and from second file - "This info= i need grep too"
In first file I use:
grep -o 'test .*</singleline>' * | sed -e 's/test \(.*\)<\/singleline>/\1/'
and successfully get "This info i need grep" but I can not get the information from the second file by using the same command.
Please help rewrite the command or write what the other.
Or, if you insist to use grep, you can:
grep -Pzo 'test(\n|.)*(?=</singleline>)' test.txt
To understand the meaning of each flag, use grep --help:
-P, --perl-regexp
PATTERN is a Perl regular expression
-o, --only-matching
show only the part of a line matching PATTERN
-z, --null-data
a data line ends in 0 byte, not newline
I'd use pcregrep, which can match multiline regexes:
pcregrep -Mo 'test \K((?s).)*?(?=</singleline>)' filename
The tricks are:
-M allows pcregrep to match on more than one line,
-o makes it print only the match,
\K throws away the part of the match that comes before it,
(?=</singleline>) is a lookahead term that matches an empty string if (and only if) it is followed by </singleline>, and
((?s).)*? to match any characters non-greedily, which is to say that if you have several occurrences of </singleline> in the file, it will match until the closest rather than the furthest. If this is not desired, remove the ?. (?s) enables the s option locally for the term to make . match newlines in it; it wouldn't do that by default.
Thanks to #CasimiretHippolyte for pointing out the ((?s).) alternative to (.|\n).
It looks like you're parsing quoted-printable encoded text, where a "soft" line break (one that is an artifact from fixed-line-width formatting) is indicated with a line-terminating = (directly before the \n).
Since in a later comment you also expressed the desire to print each match as a single line, I suggest the following 2-pass appraoch:
use awk to remove the soft line breaks
then use grep on the result
awk '/=$/ { printf "%s", substr($0, 1, length($0)-2); next } 1' file |
grep -Po 'test .*?(?=</singleline>)'
Tip of the hat to Wintermute's helpful answer for the non-greedy quantifier, *?, and both Wintermute's and Maroun Maroun's helpful answer for the positive look-ahead assertion, (?=...).
Not that the awk command removes the line-ending = (along with the newline); replace the substr call with just $0 to retain it.
Since strings of interest are first converted back their original single-line representations:
The matches are printed in their original form.
You can use regular (GNU) grep with line-by-line matching; contrast this with
needing to read the entire file at once, as in Maroun Maroun's helpful answer.
Note that, as of this writing, * must be replaced with *? in his answer to work correctly work in files with multiple matches.
needing to install another utility, pcregrep, as in Wintermute's helpful answer.
additionally, the matches would have to be cleaned up to be single-line (something you didn't originally state as a requirement).

using sed to copy lines and delete characters from the duplicates

I have a file that looks like this:
#"Afghanistan.png",
#"Albania.png",
#"Algeria.png",
#"American_Samoa.png",
I want it to look like this
#"Afghanistan.png",
#"Afghanistan",
#"Albania.png",
#"Albania",
#"Algeria.png",
#"Algeria",
#"American_Samoa.png",
#"American_Samoa",
I thought I could use sed to do this but I can't figure out how to store something in a buffer and then modify it.
Am I even using the right tool?
Thanks
You don't have to get tricky with regular expressions and replacement strings: use sed's p command to print the line intact, then modify the line and let it print implicitly
sed 'p; s/\.png//'
Glenn jackman's response is OK, but it also doubles the rows which do not match the expression.
This one, instead, doubles only the rows which matched the expression:
sed -n 'p; s/\.png//p'
Here, -n stands for "print nothing unless explicitely printed", and the p in s/\.png//p forces the print if substitution was done, but does not force it otherwise
That is pretty easy to do with sed and you not even need to use the hold space (the sed auxiliary buffer). Given the input file below:
$ cat input
#"Afghanistan.png",
#"Albania.png",
#"Algeria.png",
#"American_Samoa.png",
you should use this command:
sed 's/#"\([^.]*\)\.png",/&\
#"\1",/' input
The result:
$ sed 's/#"\([^.]*\)\.png",/&\
#"\1",/' input
#"Afghanistan.png",
#"Afghanistan",
#"Albania.png",
#"Albania",
#"Algeria.png",
#"Algeria",
#"American_Samoa.png",
#"American_Samoa",
This commands is just a replacement command (s///). It matches anything starting with #" followed by non-period chars ([^.]*) and then by .png",. Also, it matches all non-period chars before .png", using the group brackets \( and \), so we can get what was matched by this group. So, this is the to-be-replaced regular expression:
#"\([^.]*\)\.png",
So follows the replacement part of the command. The & command just inserts everything that was matched by #"\([^.]*\)\.png", in the changed content. If it was the only element of the replacement part, nothing would be changed in the output. However, following the & there is a newline character - represented by the backslash \ followed by an actual newline - and in the new line we add the #" string followed by the content of the first group (\1) and then the string ",.
This is just a brief explanation of the command. Hope this helps. Also, note that you can use the \n string to represent newlines in some versions of sed (such as GNU sed). It would render a more concise and readable command:
sed 's/#"\([^.]*\)\.png",/&\n#"\1",/' input
I prefer this over Carles Sala and Glenn Jackman's:
sed '/.png/p;s/.png//'
Could just say it's personal preference.
or one can combine both versions and apply the duplication only on lines matching the required pattern
sed -e '/^#".*\.png",/{p;s/\.png//;}' input

Is there a truly universal wildcard in Grep? [duplicate]

This question already has answers here:
How do I match any character across multiple lines in a regular expression?
(26 answers)
Closed 3 years ago.
Really basic question here. So I'm told that a dot . matches any character EXCEPT a line break. I'm looking for something that matches any character, including line breaks.
All I want to do is to capture all the text in a website page between two specific strings, stripping the header and the footer. Something like HEADER TEXT(.+)FOOTER TEXT and then extract what's in the parentheses, but I can't find a way to include all text AND line breaks between header and footer, does this make sense? Thanks in advance!
When I need to match several characters, including line breaks, I do:
[\s\S]*?
Note I'm using a non-greedy pattern
You could do it with Perl:
$ perl -ne 'print if /HEADER TEXT/ .. /FOOTER TEXT/' file.html
To print only the text between the delimiters, use
$ perl -000 -lne 'print $1 while /HEADER TEXT(.+?)FOOTER TEXT/sg' file.html
The /s switch makes the regular expression matcher treat the entire string as a single line, which means dot matches newlines, and /g means match as many times as possible.
The examples above assume you're cranking on HTML files on the local disk. If you need to fetch them first, use get from LWP::Simple:
$ perl -MLWP::Simple -le '$_ = get "http://stackoverflow.com";
print $1 while m!<head>(.+?)</head>!sg'
Please note that parsing HTML with regular expressions as above does not work in the general case! If you're working on a quick-and-dirty scanner, fine, but for an application that needs to be more robust, use a real parser.
By definition, grep looks for lines which match; it reads a line, sees whether it matches, and prints the line.
One possible way to do what you want is with sed:
sed -n '/HEADER TEXT/,/FOOTER TEXT/p' "$#"
This prints from the first line that matches 'HEADER TEXT' to the first line that matches 'FOOTER TEXT', and then iterates; the '-n' stops the default 'print each line' operation. This won't work well if the header and footer text appear on the same line.
To do what you want, I'd probably use perl (but you could use Python if you prefer). I'd consider slurping the whole file, and then use a suitably qualified regex to find the matching portions of the file. However, the Perl one-liner given by '#gbacon' is an almost exact transliteration into Perl of the 'sed' script above and is neater than slurping.
The man page of grep says:
grep, egrep, fgrep, rgrep - print lines matching a pattern
grep is not made for matching more than a single line. You should try to solve this task with perl or awk.
As this is tagged with 'bbedit' and BBedit supports Perl-Style Pattern Modifiers you can allow the dot to match linebreaks with the switch (?s)
(?s).
will match ANY character. And yes,
(?s).+
will match the whole text.
As pointed elsewhere, grep will work for single line stuff.
For multiple-lines (in ruby with Regexp::MULTILINE, or in python, awk, sed, whatever), "\s" should also capture line breaks, so
HEADER TEXT(.*\s*)FOOTER TEXT
might work ...
here's one way to do it with gawk, if you have it
awk -vRS="FOOTER" '/HEADER/{gsub(/.*HEADER/,"");print}' file