How to print only matches with sed? - regex

Okay, this is an easy one, but I can't figure it out.
Basically I want to extract all links ([^<>]*) from a big html file.
I tried to do this with sed, but I get all kinds of results, just not what I want. I know that my regexp is correct, because I can replace all the links in a file:
sed 's_[^<>]*_TEST_g'
If I run that on something like
<div>A google link</div>
<div>A google link</div>
I get
<div>TEST</div>
<div>TEST</div>
How can I get rid of everything else and just print the matches instead? My preferred end result would be:
A google link
A google link
PS. I know that my regexp is not the most flexible one, but it's enough for my intentions.

Match the whole line, put the interesting part in a group, replace by the content of the group. Use the -n option to suppress non-matching lines, and add the p modifier to print the result of the s command.
sed -n -e 's!^.*\(<[Aa] [^<>]*>.*</[Aa]>\).*$!\1!p'
Note that if there are multiple links on the line, this only prints the last link. You can improve on that, but it goes beyond simple sed usage. The simplest method is to use two steps: first insert a newline before any two links, then extract the links.
sed -n -e 's!</a>!&\n!p' | sed -n -e 's!^.*\(<[Aa] [^<>]*>.*</[Aa]>\).*$!\1!p'
This still doesn't handle HTML comments, <pre>, links that are spread over several lines, etc. When parsing HTML, use an HTML parser.

If you don't mind using perl like sed it can copy with very diverse input:
perl -n -e 's+(<a href=.*?</a>)+ print $1, "\n" +eg;'

Assuming that there is only one hyperlink per line the following may work...
sed -e 's_.*&lta href=_&lta href=_' -e 's_>.*_>ed &lt&lt'EOF'
-e 's_.*&lta href=_&lta href=_' -e 's_>.*_>_'

This might work for you (GNU sed):
sed '/<a href\>/!d;s//\n&/;s/[^\n]*\n//;:a;$!{/>/!{N;ba}};y/\n/ /;s//&\n/;P;D' file

Related

Sed dynamic backreference replacement

I am trying to use sed for transforming wikitext into latex code. I am almost done, but I would like to automate the generation of the labels of the figures like this:
[[Image(mypicture.png)]]
... into:
\includegraphics{mypicture.png}\label{img-1}
For what I would like to keep using sed. The current regex and bash code I am using is the following:
__tex_includegraphics="\\\\includegraphics[width=0.95\\\\textwidth]{$__images_dir\/"
__tex_figure_pre="\\\\begin{figure}[H]\\\\centering$__tex_includegraphics"
__tex_figure_post="}\\\\label{img-$__images_counter}\\\\end{figure}"
sed -e "s/\[\[Image(\([^)]*\))\]\].*/$__tex_figure_pre\1$__tex_figure_post/g"\
... but I cannot make that counter to be increased. Any ideas?
Within a more general perspective, my question would be the following: can I use a backreference in sed for creating a replacement that is different for each of the matches of sed? This is, each time sed matches the pattern, can I use \1 as the input of a function and use the result of this function as the replacement?
I know it is a tricky question and I might have to use AWK for this. However, if somebody has a solution, I would appreciate his or her help.
This might work for you (GNU sed):
sed -r ':a;/'"$PATTERN"'/{x;/./s/.*/echo $((&+1))/e;/./!s/^/1/;x;G;s/'"$PATTERN"'(.*)\n(.*)/'"$PRE"'\2'"$POST"'\1/;ba}' file
This looks for a PATTERN contained in a shell variable and if not presents prints the current line. If the pattern is present it increments or primes the counter in the hold space and then appends said counter to the current line. The pattern is then replaced using the shell variables PRE and POST and counter. Lastly the current line is checked for further cases of the pattern and the procedure repeated if necessary.
You could read the file line-by-line using shell features, and use a separate sed command for each line. Something like
exec 0<input_file
while read line; do
echo $line | sed -e "s/\[\[Image(\([^)]*\))\]\].*/$__tex_figure_pre\1$__tex_figure_post/g"
__images_counter=$(expr $__images_counter + 1)
done
(This won't work if there are multiple matches in a line, though.)
For the second part, my best idea is to run sed or grep to find what is being matched, and then run sed again with the value of the function of the matched text substituted into the command.

Print RegEx matches using SED in bash

I have an XML file, the file is made up of one line.
What I am trying to do is extract the "finalNumber" attribute value from the file via Putty. Rather than having to download a copy and search using notepad++.
I've built up a regular expression that I've tested on an On-line Tool, and tried using it within a sed command to duplicate grep functionality. The command runs but doesn't return anything.
RegEx:
(?<=finalNumber=")(.*?)(?=")
sed Command (returns nothing, expected 28, see file extract):
sed -n '/(?<=finalNumber=")(.*?)(?=")/p' file.xml
File Extract:
...argo:finalizedDate="2012-02-09T00:00:00.000Z" argo:finalNumber="28" argo:revenueMonth=""...
I feel like I am close (i could be wrong), am I on the right lines or is there better way to achieve the output?
Nothing wrong with good old grep here.
grep -E -o 'finalNumber="[0-9]+"' file.xml | grep -E -o '[0-9]+'
Use -E for extended regular expressions, and -o to print only the matching part.
Though you already select an answer, here is a way you can do in pure sed:
sed -n 's/^.*finalNumber="\([[:digit:]]\+\)".*$/\1/p' <test
Output:
28
This replaces the entire line by the match number and print (because p will print the entire line so you have to replace the entire line)
This might work for you (GNU sed):
sed -r 's/.*finalNumber="([^"]*)".*/\1/' file
sed does not support look-ahead assertions. Perl does, though:
perl -ne 'print $1 if /(?<=finalNumber=")(.*?)(?=")/'
As I understand, there is no need to use look-aheads here.
Try this one
sed -n '/finalNumber="[[:digit:]]\+"/p'

sed conditional replace of a variable

I have within a file a bunch of codenumbers that in general are of the form, integer.integer The first integer is necessary, the second may be empty. e.g. 123.45 or 12.345 and 12 are all valid codenumbers.
I want to use sed to change each of these lines into
job{123}subjob{45}
job{12}subjob{345}
job{12}
So far I have
sed -e 's/codenumber{\([0-9]*\)\.*\([0-9]*\)}/job{\1}subjob{\2}/g'
which results in
job{123}subjob{45}
job{12}subjob{345}
job{12}subjob{}
Is there a way for sed to realise that when the variable \2 is empty, to print a default value instead, say 0. Hence the last line of the given example would say
job{12}subjob{0}
I suppose this could be possible via two sed runs, but I am interested if it was possible with one.
You could simply extend your sed command to patch up empty subjob numbers:
sed -e 's/codenumber{\([0-9]*\)\.*\([0-9]*\)}/job{\1}subjob{\2}/g' \
-e 's/subjob{}/subjob{0}/g'
I don't think this is possible in sed. But indeed you can do two sed runs (they're really fast so it shouldn't be a problem), the second being
sed -e 's/subjob\{\}//g'
This might work for you (GNU sed):
sed 's/codenumber/job/;s/\./}subjob{/;/subjob/!s/$/subjob{0}/' file
I know, a bit too late and in addition not answering the question, but if someone lands here as I did, and does not mind to use Perl instead, so the expression is:
s/codenumber{([0-9]*)\.*([0-9]*)}/job{$1}subjob{${\($2?$2:"0")}}/g
e.g. in:
echo -e 'codenumber{123.45}\ncodenumber{12.345}\ncodenumber{12}' | perl -e 'while(<STDIN>) { print s/codenumber{([0-9]*)\.*([0-9]*)}/job{$1}subjob{${\($2?$2:"0")}}/gr;}'
So, the point is that Perl allows you to use the string interpolation ${\(EXPRESSION)} also in the regexp and calculate the replacement, based on the matched value, using a Perl expression.

Why does sed /^$/d delete only blank lines but /^$/p print all lines?

I'm able to use sed /^$/d <file> to delete all the blank lines in the file, but what if I want to print all the blank lines only? The command sed /^$/p <file> prints all the lines in file.
The reason I want to do this is that we use an EDA program (Expedition) that uses regex to run rules on the names of nets. I'm trying to find a way to search for all nets that don't have names assigned. I thought using ^$ would work, but it just ends up finding all nets, which is what /^$/p is doing too. So is there a different way to do this?
Unless otherwise specified sed will print the pattern space when it has finished processing it. If you look carefully at your output you'll notice that you get 2 blank lines for every one in the file. You'll have to use the -n command line switch to stop sed from printing.
sed -n /^$/p infile
Should work as you want.
You can also use grep as:
grep '^$' infile
Sed prints every line by default, and so the p flag is useless. To make it useful, you need to give sed the -n switch. Indeed, the following appears to do what you want:
sed -n /^$/p
think in another way, don't p, but !d
you may try:
sed '/^$/!d' yourFile

Sed substitution not doing what I want and think it should do

I have am trying to use sed to get some info that is encoded within the path of a file which is passed as a parameter to my script (Bourne sh, if it matters).
From this example path, I'd like the result to be 8
PATH=/foo/bar/baz/1-1.8/sing/song
I first got the regex close by using sed as grep:
echo $PATH | sed -n -e "/^.*\/1-1\.\([0-9][0-9]*\).*/p"
This properly recognized the string, so I edited it to make a substitution out of it:
echo $PATH | sed -n -e "s/^.*\/1-1\.\([0-9][0-9]*\).*/\1/"
But this doesn't produce any output. I know I'm just not seeing something simple, but would really appreciate any ideas about what I'm doing wrong or about other ways to debug sed regular expressions.
(edit)
In the example path the components other than the numerical one can contain numbers similar to the numeric path component that I listed, but not quite the same. I'm trying to exactly match the component that that is 1-1. and see what some-number is.
It is also possible to have an input string that the regular expression should not match and should product no output.
The -n option to sed supresses normal output, and since your second line doesn't have a p command, nothing is output. Get rid of the -n or stick a p back on the end
It looks like you're trying to get the 8 from the 1-1.8 (where 8 is any sequence of numerics), yes? If so, I would just use:
echo /foo/bar/baz/1-1.8/sing/song | sed -e "s/.*\/1-1\.//" -e "s/[^0-9].*//"
No doubt you could get it working with one sed "instruction" (-e) but sometimes it's easier just to break it down.
The first strips out everything from the start up to and including 1-1., the second strips from the first non-numeric after that to the end.
$ echo /foo/bar/baz/1-1.8/sing/song | sed -e "s/.*\/1-1\.//" -e "s/[^0-9].*//"
8
$ echo /foo/bar/baz/1-1.752/sing/song | sed -e "s/.*\/1-1\.//" -e "s/[^0-9].*//"
752
And, as an aside, this is actually how I debug sed regular expressions. I put simple ones in independent instructions (or independent part of a pipeline for other filtering commands) so I can see what each does.
Following your edit, this also works:
$ echo /foo/bar/baz/1-1.962/sing/song | sed -e "s/.*\/1-1\.\([0-9][0-9]*\).*/\1/"
962
As to your comment:
In the example path the components other than the numerical one can contain numbers similar to the numeric path component that I listed, but not quite the same. I'm trying to exactly match the component that that is 1-1. and see what some-number is.
The two-part sed command I gave you should work with numerics anywhere in the string (as long as there's no 1-1. after the one you're interested in). That's because it actually deletes up to the specific 1-1. string and thereafter from the first non-numeric). If you have some examples that don't work as expected, toss them into the question as an update and I'll adjust the answer.
You can shorten you command by using + (one or more) instead of * (zero or more):
sed -n -e "s/^.*\/1-1\.\([0-9]\+\).*/\1/"
don't use PATH as your variable. It clashes with PATH environment variable
echo $path|sed -e's/.*1-1\.//;s/\/.*//'
You needn't divide your patterns with / (s/a/b/g), but may choose every character, so if you're dealing with paths, # is more useful than /:
echo /foo/1-1.962/sing | sed -e "s#.*/1-1\.\([0-9]\+\).*#\1#"