extract string of random length after a defined pattern - regex

Here is the text I want to extract info from:
<ul class="disc">
<li><strong>euro195.com</strong></li>
<li><strong>euro213.com</strong></li>
<li><strong>uk180.com</strong> <span class="red">(optimized for web surfing; no p2p downloading)</span></li>
<li><strong>us1.com</strong> <span class="red">(optimized for web surfing; no p2p downloading)</span></li>
<li><strong>us2.com</strong> <span class="red">(optimized for web surfing; no p2p downloading)</span></li>
<li>Username: <strong>user1</strong></li>
<li>Password: <strong>pswd1</strong></li>
</ul>
<div><strong><span class="green"> More servers coming.</span></strong></div>
</div><!-- .columns -->
From this text, username and password should be fetched in the following 2 ways:
1.
Username:user1
pswd:pswd1
2.
user1
pswd1
================
1.
I only can have
<li>Username: <strong>user1</strong></li>
<li>Password: <strong>pswd1</strong></li>
with the following
egrep 'Username|Password' file
or this
<li>Username: <strong>user1
<li>Password: <strong>pswd1
with the following
grep -oP 'Username:.{0,16}|Password:.{0,16}' file
but this implies that the info length are always the same which is not the case.
2.
Here is what I tried, in vain again:
grep -oP "(?<=(Username: \<strong\>|Password: \<strong\>))[^>]*\<" text4
thanx a lot for your help guys!

Not sure if it is a good job for grep, but you can use
cat text4 | egrep '(Username:|Password:)' | sed 's/^.*<strong>\(.*\)<\/strong>.*$/\1/'

It's better to use an html parser rather than grep.
$ grep -oP "(?<=(Username: <strong>|Password: <strong>))[^<]*" file
user1
pswd1
You need to add the exact spaces in the lookbehind otherwise it won't work and don't need to escape < or >.

Related

Search pattern between tags in html

I need to get value from a tag with specific title.
I have this command.
sed -n 's/title="view quote">\(.*\)<\/a>/\1/p' index.html
This is part of index.html and i need that 'Everything in life is luck'
<a title="view quote" href="https://www.brainyquote.com/quotes/donald_trump_106578" class="oncl_q">
<img id="qimage_106578" src="./Donald Trump Quotes - BrainyQuote_files/donaldtrump1.jpg" class="bqphtgrid" alt="Everything in life is luck. - Donald Trump">
</a>
</div>
Everything in life is luck.
Donald Trump
</div>
And i need all this vlaues to fill in array in bash.
Your sed command is mostly good - just missing .* at each end of regex to remove additional head and tail.
This command extract all values with your specific title:
sed -n 's/.*title="view quote">\(.*\)<\/a>.*/\1/p' index.html
To put into an array:
IFS=$'\n' array=( $(sed -n 's/.*title="view quote">\(.*\)<\/a>.*/\1/p' index.html) )
To verify your result array:
for ((i=0;i<${#array[#]};i++)); do
echo ${array[$i]}
done

Why is my regex failing to select the correct elements, when it works on the online regex tester

I have a number of xml files, that has HTML embedded in a node . I need capture everything that is not the tags, add some non HTML tags (for moodle) around the text.
I'm processing the files from the command line, using a bash script. I'm using xpath to get the content, piping through xargs to sneakily rip out newlines and then piping through sed.
Heres a sample of the tag:
xpath -q -e '/activity/page/content' page.xml|xargs
<content><h3 style=float:right><img
src=##PLUGINFILE##/consumables.png> </h3> <h3>TITLE</h3>
<p>In order to conduct an LE5 drug test you need a Druglizaer
(batch controlled) foil pouch that contains two items:</p>
<p></p> <ol> <li><span style=font-
weight:900>Druglizer Cartridge</span></li><li><span
style=font-weight:900>Druglizer Oral Fluid
Collector</span></li> </ol> <p></p></content>
On https://regex101.com/ I used \>(.*?)\< which is grouping the text as expected. but when I run with sed it isn't doing any substitutions.
#!/bin/bash
# get new name string
name=$(xpath -q -e '/activity/page/name' page.xml);
en=$(echo $name|sed -e 's/<[^>]*>//g');
vi=$(echo $en|trans -brief -t vi);
cn=$(echo $en|trans -brief -t zh-CN);
mlang_name=$(echo "{mlang en}$en{mlang}{mlang
vi}$vi{mlang}{mlang
zh_cn}$cn{mlang}")
# xmlstarlet to update node
# get new content string
content=$(xpath -q -e '/activity/page/content' page.xml);
# \>(.*?)\<
mlang_name=$(echo $content|sed -e 's/\>(.*?)\</\{mlang
en\}$1\{mlang\}\{mlang
vi\}#VI#\{mlang\}\{mlang
zh_cn\}#CN#\{mlang\}/g')
# xmlstarlet to update node
I need the replace to put {mlang en}TEXT{mlang} around the text.
I ended up using perl as it supports the non-greedy format i was using.
perl -pe 's/(.*?>)(.*?)(<.*?)/$1\{mlang en\}$2\{mlang\}$3/g'
With the above file, the full command I used was
content=$(xpath -q -e '/activity/page/content' page.xml);echo $content|xargs|sed -e 's/<|<content>//g'|sed -e 's|</content>||g' |perl -pe 's/(.*?>)(.*?)(<.*?)/$1\{mlang en\}$2\{mlang\}$3/g'|sed -e 's/{mlang en}[\ ]*{mlang}//g'|sed -e 's/<content>//g'
Which gave the following output
<h3 style=float:right><img src=##PLUGINFILE##/consumables.png></h3><h3>{mlang en}TITLE{mlang}</h3><p>{mlang en}In order to conduct an LE5 drug test you need a Druglizaer (batch controlled) foil pouch that contains two items:{mlang}</p><p></p><ol><li><span style=font-weight:900>{mlang en}Druglizer LE5 Cartridge{mlang}</span></li><li><span style=font-weight:900>{mlang en}Druglizer Oral Fluid Collector{mlang}</span></li></ol><p></p>
If there's a more elegant way feel free to let me know.

Sed program - deleted strings reappearing?

I'm stumped. I have an HTML file that I'm trying to convert to plain text and I'm using sed to clean it up. I understand that sed works on the 'stream' and works one line at a time, but there are ways to match multiline patterns.
Here is the relevant section of my source file:
<h1 class="fn" id="myname">My Name</h1>
<span class="street-address">123 street</span>
<span class="locality">City</span>
<span class="region">Region</span>
<span class="postal-code">1A1 A1A</span>
<span class="email">my#email.ca</span>
<span class="tel">000-000-0000</span>
I would like this to be made into the following plaintext format:
My Name
123 street
City Region 1A1 A1A
my#email.ca
000-000-0000
The key is that City, Region, and Post code are all on one line now.
I use sed -f commands.sed file.html > output.txt and I believe that the following sed program (commands.sed) should put it in that format:
#using the '#' symbol as delimiter instead of '/'
#remove tags
s#<.*>\(.*\)</.*>#\1#g
#remove the nbsp
s#\( \)*##g
#add a newline before the address (actually typing a newline in the file)
s#\(123 street\)#\
\1#g
#and now the command that matches multiline patterns
#find 'City',read in the next two lines, and separate them with spaces
/City/ {
N
N
s#\(.*\)\n\(.*\)\n\(.*\)#\1 \2 \3#g
}
Seems to make sense. Tags are all stripped and then three lines are put into one.
Buuuuut it doesn't work that way. Here is the result I get:
My Name
123 street
City <span class="region">Region</span> <span class="postal-code">1A1 A1A</span>
my#email.ca
000-000-0000
To my (relatively inexperienced) eyes, it looks like sed is 'forgetting' the changes it made (stripping off the tags). How would I solve this? Is the solution to write the file after three commands and re-run sed for the fourth? Am I misusing sed? Am I misunderstanding the 'stream' part?
I'm running Mac OS X 10.4.11 with the bash shell and using the version of sed that comes with it.
I think you're confused. Sed operates line-by-line, and runs all commands on the line before moving to the next. You seem to be assuming it strips the tags on all lines, then goes back and runs the rest of the commands on the stripped lines. That's simply not the case.
See RegEx match open tags except XHTML self-contained tags ... and stop using sed for this.
Sed is a wonderful tool, but not for processing HTML. I suggest using Python and BeautifulSoup, which is basically built just for this sort of task.
If you have only one data block per php file, try the following (using sed)
kent$ cat t
<h1 class="fn" id="myname">My Name</h1>
<span class="street-address">123 street</span>
<span class="locality">City</span>
<span class="region">Region</span>
<span class="postal-code">1A1 A1A</span>
<span class="email">my#email.ca</span>
<span class="tel">000-000-0000</span>
kent$ sed 's/<[^>]*>//g; s/ //g' t |sed '1G;3{N;N; s/\n/ /g}'
My Name
123 street
City Region 1A1 A1A
my#email.ca
000-000-0000

grab value between two strings with sed?

I have the following data on one line:
Go to start of metadata
<div id="page-metadata-end" class="assistive"></div>
<fieldset class="hidden parameters">
<input type="hidden" title="browsePageTreeMode" value="view">
</fieldset>
<div class="wiki-content">
<p>(openissues)81(/openissues)</p><p>(assignstoday)0(/assignstoday)</p><p>(assignsweek)2(/assignsweek)</p><p>(replyissues)6(/replyissues)</p><p>(wrapissues)26(/wrapissues)</p>
</div>
I'd like to grab the value for "openissues" for example, but I can't figure out to properly retrieve this. One of the things I tried is the following command:
sed -n '/(assignstoday)/,/(\/assignstoday)/p' ~/test.txt
Any help?
sed 's/.*(openissues)\(.*\)(\/openissues).*/\1/' test.txt
a quick hack to possibly meet your edited requirement:
sed -n '/openissues/p' test.txt | sed 's/.*(openissues)\(.*\)(\/openissues).*/\1/'
but regexes are really not the way to go when parsing HTML.
I'd try
VALUE=openissues
sed 's#.*('"$VALUE"')\([^(]\+\).*#\1#'
that is, replace everything except the contents of what you are searching, with that content.
edit: Now I see Neil's answer, that's practically the same, accept his. I leave my answer for the customization of which value you want to extract.

delete html comment tags using regexp

This is how my text (html) file looks like
<!--
| |
| This is a dummy comment |
| please delete me |
| asap |
| |
________________________________
| -->
this is another line
in this long dummy html file...
please do not delete me
I'm trying to delete the comment using sed :
cat file.html | sed 's/.*<!--\(.*\)-->.*//g'
It doesn't work :( What am I doing wrong?
Thank you very much for your help!
patrickmdnet has the correct answer. Here it is on one line using extended regex:
cat file.html | sed -e :a -re 's/<!--.*?-->//g;/<!--/N;//ba'
Here is a good resource for learning more about sed. This sed is an adaptation of one-liner #92
http://www.catonmat.net/blog/sed-one-liners-explained-part-three/
One problem with your original attempt is that your regex only handles comments that are entirely on one line. Also, the leading and trailing ".*" will remove non-comment text.
You would better off using existing code instead of rolling your own.
http://sed.sourceforge.net/grabbag/scripts/strip_html_comments.sed
#! /bin/sed -f
# Delete HTML comments
# i.e. everything between <!-- and -->
# by Stewart Ravenhall <stewart.ravenhall#ukonline.co.uk>
/<!--/!b
:a
/-->/!{
N
ba
}
s/<!--.*-->//
(from http://sed.sourceforge.net/grabbag/scripts/)
See this link for various ways to use perl modules for removing HTML comments (using Regexp::Common, HTML::Parser, or File::Comments.) I am sure there are methods using other utilities.
http://www.perlmonks.org/?node_id=500603
I think you can do this with awk if you want. Start:
[~] $ more test.txt
<!--
An HTML style comment
-->
Some other text
<div>
<p>blah</p>
</div>
<!-- Whoops
Another comment -->
<span>Something</span>
Result of the awk:
[~]$ cat test.txt | awk '/<!--/ {off=1} /-->/ {off=2} /([\s\S]*)/ {if (off==0) print; if (off==2) off=0}'
Some other text
<div>
<p>blah</p>
</div>
<span>Something</span>
Improving (hopefully) on the awk-based answer provided by eldarerathis --
The code below addresses the concern raised by john-jones.
In this version, the prefix leading up to the start of the html comment is preserved, as is the suffix following the close of the html comment.
$ cat some-file | awk '/<!--/ { mode=1; start=index($0,"<!--"); prefix=substr($0,1,start-1); } /-->/ { mode=2; start=index($0, "-->")+3; suffix=substr($0,start); print prefix suffix; prefix=""; suffix=""; } /./ { if (mode==0) print $0; if (mode==2) mode=0; }'
for example
$ cat test.txt
<!--
An HTML style comment
-->
<meta charset="utf-8"> <!-- charset encoding must be within the first 1024 bytes of the document -->
Some other text
<div>
<p>blah</p>
</div>
<!-- Whoops
Another comment -->
<span>Something</span>
<div> <!-- start of foo -->
foo
</div> <!-- end of foo -->
<div> <!-- start of multiline comment
bar
end of multiline comment --> </div>
$ cat test.txt | awk '/<!--/ { mode=1; start=index($0,"<!--"); prefix=substr($0,1,start-1); } /-->/ { mode=2; start=index($0, "-->")+3; suffix=substr($0,start); print prefix suffix; prefix=""; suffix=""; } /./ { if (mode==0) print $0; if (mode==2) mode=0; }'
Some other text
<div>
<p>blah</p>
</div>
<span>Something</span>
<meta charset="utf-8">
<div>
foo
</div>
<div> </div>