grab value between two strings with sed? - regex

I have the following data on one line:
Go to start of metadata
<div id="page-metadata-end" class="assistive"></div>
<fieldset class="hidden parameters">
<input type="hidden" title="browsePageTreeMode" value="view">
</fieldset>
<div class="wiki-content">
<p>(openissues)81(/openissues)</p><p>(assignstoday)0(/assignstoday)</p><p>(assignsweek)2(/assignsweek)</p><p>(replyissues)6(/replyissues)</p><p>(wrapissues)26(/wrapissues)</p>
</div>
I'd like to grab the value for "openissues" for example, but I can't figure out to properly retrieve this. One of the things I tried is the following command:
sed -n '/(assignstoday)/,/(\/assignstoday)/p' ~/test.txt
Any help?

sed 's/.*(openissues)\(.*\)(\/openissues).*/\1/' test.txt
a quick hack to possibly meet your edited requirement:
sed -n '/openissues/p' test.txt | sed 's/.*(openissues)\(.*\)(\/openissues).*/\1/'
but regexes are really not the way to go when parsing HTML.

I'd try
VALUE=openissues
sed 's#.*('"$VALUE"')\([^(]\+\).*#\1#'
that is, replace everything except the contents of what you are searching, with that content.
edit: Now I see Neil's answer, that's practically the same, accept his. I leave my answer for the customization of which value you want to extract.

Related

Search pattern between tags in html

I need to get value from a tag with specific title.
I have this command.
sed -n 's/title="view quote">\(.*\)<\/a>/\1/p' index.html
This is part of index.html and i need that 'Everything in life is luck'
<a title="view quote" href="https://www.brainyquote.com/quotes/donald_trump_106578" class="oncl_q">
<img id="qimage_106578" src="./Donald Trump Quotes - BrainyQuote_files/donaldtrump1.jpg" class="bqphtgrid" alt="Everything in life is luck. - Donald Trump">
</a>
</div>
Everything in life is luck.
Donald Trump
</div>
And i need all this vlaues to fill in array in bash.
Your sed command is mostly good - just missing .* at each end of regex to remove additional head and tail.
This command extract all values with your specific title:
sed -n 's/.*title="view quote">\(.*\)<\/a>.*/\1/p' index.html
To put into an array:
IFS=$'\n' array=( $(sed -n 's/.*title="view quote">\(.*\)<\/a>.*/\1/p' index.html) )
To verify your result array:
for ((i=0;i<${#array[#]};i++)); do
echo ${array[$i]}
done

Regex - replace wrappers around strings

What would be a regex expression that could turn this:
{{ Form::label('events', 'Events') }}
Into this:
<label for="events">Events</label>
I need the strings "events" and "Events" to remain in tact.
try this:
s/.*::(.*?)\('(.*?)',\s'(.*?)'.*/<$1 for="$2">$3</$1>/
DEMO
This would also work for your sample, this is formatted with sed:
sed -E "s#[^']+'([^']+)', '([^']+)'.*#<label for=\"\1\">\2</label>#"
If you want it in two pieces:
s#[^']+'([^']+)', '([^']+)'.*
<label for=\"\1\">\2</label>

How to insert an arbitrary string after pattern with sed

It must be really easy, but somehow I don't get it… I want to process an HTML-file via a bash script and insert an HTML-String into a certain node:
org.html: <div id="wrapper"></div>
MYTEXT=$(phantomjs capture.js www.somesite.com)
# MYTEXT will look something like this:
# <div id="test" style="top: -1.9%;">Something</div>
sed -i "s/\<div id=\"wrapper\"\>/\<div id=\"wrapper\"\>$MYTEXT/" org.html
I always get this error: bad flag in substitute command: 'd' which is probably because sed interprets the content of $MYTEXT as a pattern as well – which is not what I want…
By the way: Duplicating \<div id=\"wrapper\"\> is probably also not necessary?
It seems the / in $MYTEXT's </div> part is interpreted indeed as the final / in the sed command. You can choose another delimiter, which does not appear in $MYTEXT, for instance:
sed -i "s|\<div id=\"wrapper\"\>|\<div id=\"wrapper\"\>$MYTEXT|" org.html

Perl script to search and replace multiple lines in multiple html files

I have many html files in a folder. I need to somehow remove a <div id="user-info" ...>...</div> from all of them. As far as I know I need to use a Perl script for that, but I don't know Perl to do that. Could someone get it for me?
Here is how the "bad" code looks like:
<div id="user-info" class="logged-in">
<a class="icon icon-key-delete" href="https://test.dev/login.php?0,logout=1">Log Out</a>
<a class="icon icon-user-edit" href="https://test.dev/control.php">Control Center</a>
</div> <!-- end of div id=user-info -->
Thank you in advance!
Using XML::XSH2:
for { glob '*.html' } {
open :F html (.) ;
delete //div[#id="user-info" and #class="logged-in"] ;
save :b ;
}
perl -0777 -i.withdiv -pe 's{<div[^>]+?id="user-info"[^>]*>.*?</div>}{}gsmi;' test.html
-0777 means split on nothing, so slurp in whole file (instead of line by line, the default for -p
-i.withdiv means alter files in place, leaving original with extension .withdiv (default for -p is to just print).
-p means pass line by line (except we are slurping) to passed code (see -e)
-e expects code to run.
man perlrun or perldoc perlrun for more info.
Here's another solution, which will be slightly more familiar to people that know jquery, as the syntax is similar. This uses Mojolicious' ojo module to load up the html content into a Mojo::DOM object, transform it, and then print that transformed version:
perl -Mojo -MFile::Slurp -E 'for (#ARGV) { say x(scalar(read_file $_))->at("#user-info")->replace("")->root; }' test.html test2.html test*.html
To replace content directly:
perl -Mojo -MFile::Slurp -E 'for (#ARGV) { write_file( $_, x(scalar(read_file $_))->at("#user-info")->replace("")->root ); }' test.html
Note, this won't JUST remove the div, it will also re-write the content based on Mojo's Mojo::DOM module, so tag attributes may not be in the same order. Specifically, I saw <div id="user-info2" class="logged-in"> rewritten as <div class="logged-in" id="user-info2">.
Mojolicious requires at least perl 5.10, but after that there's no non-core requirements.

One more greedy sed question

I'm doing an automated download of a number of images using an html frame source. So fra, so good, Sed, wget. Example of the frame source:
<td width="25%" align="center" valign="top"><img border="1" src="t_core.gif" width="120" height="90"><font size="1" face="Verdana"><br>Hyperspace Core<br>(Rob Cunningham)</font></td>
So I do this:
sed -n -e 's/^.*htm?\(.*jpg\).*$/\1/p' concept.htm
to get the part which looks like this:
concept_Core.jpg
to do then this:
wget --base=/some/url/concept_Core.jpg
But there is one nasty line. That line, obvioulsy, is a bug in the site, or whatever it can be, but it is wrong, I can't change it, however. ;)
<td width="25%" bla bla face="Verdana"><a href="images/display.htm?concept_frigate16.jpg" target="_top"><img bla bla href="images/concept_frigate16.jpg" target="_top"><br>Frigate 16<br>
That is, two of these "concept_Frigate16.jpg" in a line. And my script gives me
concept_frigate16.jpg" target="_top"><img border="1" src="t_assaultfrigate.gif" width="120" height="90" alt="The '16' in the name may be a Sierra typo."></a><a href="images/concept_frigate16.jpg
You understand why. Sed is greedy and this obviously shows up in this case.
Now the question is, how do I get rid of this corner case? That is, make it non-greedy and make it stop on the FIRST .jpg?emphasized text
use perl:
perl -pe 's/^.*htm?\(.*?jpg\).*$/\1/'
You might want to consider changing:
\(.*jpg\)
into:
\([^"]*jpg\)
This should stop your initial search going beyond the end of the first href. Whether that will introduce other problems (for other edge cases) is a little difficult to say given I don't know the full set of inputs.
If it does, you may want to opt for using a real parser rather than regexes. Regexes are a powerful tool but they're not necessarily suited for everything.
Use [^"] instead of . in the regular expression.
This will pick all characters except the appostrophes.
sed -n -e 's/^.*htm?\([^"]*jpg\).*$/\1/p'
GNU grep can do PCRE:
grep -Po '(?<=\.htm\?).*?jpg' concept.htm