RegEx to match all text between two strings that slightly alter [duplicate] - regex

This question already has answers here:
RegEx match open tags except XHTML self-contained tags
(35 answers)
Closed 3 years ago.
I am currently working on an AIR app and I'm trying to get a certain block of text from a website where that block of text is always between two specific strings that contain links that change from page to page.
It looks something like this:
<p>Previous Chapter <span style="float: right">Next Chapter</span></p>
.
.
_desired content_
.
.
<p>Previous Chapter <span style="float: right">Next Chapter</span></p>
*The two strings are identical
Now, I have tried several RegEx expressions but without success. I just can't get my head around Regex in general...
The last expression I've tried is: /(?<=<p><a href=\".+\">Previous Chapter<\/a> <span style=\"float: right\"><a href=\".+\">Next Chapter<\/a><\/span><\/p>)(.*)(?=<p><a href=\".+\">Previous Chapter<\/a> <span style=\"float: right\"><a href=\".+\">Next Chapter<\/a><\/span><\/p>)/gsi
but that one isn't even being recognized as a RegEx.
I would really appreciate any help with the subject.
Thanks in advance!
EDIT:
Thanks to Organis's help I managed to solve the problem, it was indeed easier and better NOT using RegEx.
This is what i ended up doing:
text=text.split("Next Chapter<\/span><\/a><\/p>")[1].split("Previous Chapter<\/a>")[0];
text=text.substring(0,text.lastIndexOf("<p><a href"));

Do not use RegEx. Read why: https://blog.codinghorror.com/parsing-html-the-cthulhu-way/.
Extract text between two fixed <span style="float: right">Next Chapter</span></a></p>, then cut finalizing <p>Previous Chapter <a href="**changes**"> off.

Related

What's the difference between these two regular expressions? [duplicate]

This question already has answers here:
What's the meaning of a number after a backslash in a regular expression?
(2 answers)
Reference - What does this regex mean?
(1 answer)
Closed 2 years ago.
The short and immediate version of the question is: Why are these two regex different? i.e.,
href=(['"]).+?\1
vs
href=(['"]).+?['"] or href=(['"]).+?(['"])
I am practicing regex on this site and I am trying to solve this level
http://play.inginf.units.it/#/level/6
I am posting the entire content here in case the site goes down in future.
<tr>
<a href="javascript:openurl('/Xplore/accessinfo.jsp')" class="topUnderlineLinks">
PDF(3141 KB)
<A href='/xpl/RecentCon.jsp?punumber=10417'>Evolutionary Computation, 2005. The 2005 IEEE Congress on</A><br>
<td width="33%" ><div align="right"> Help <a href="/xpl/contactus.jsp" class="subNavLinks">Contact
Kimya ile ilgili çeþitli temel referans
<a href="http://search.epnet.com/login.asp?profile=web&defaultdb=geh"
<a href="http://iimpft.chadwyck.com/" target="_parent">International
NFPA Standartlarý
Project Gutenberg
<a href="http://proquestcombo.safaribooksonline.com/?portal=proquestcombo&uicode=istanbultek"
Scitation
dergilerin listesini görmek için bu yolu
<a href="http://www3.interscience.wiley.com/journalfinder.html"
<td width="46%"><a href="/xpl/periodicals.jsp" class="dropDownNav" accesskey="j">Journals & Magazines
<td>IEEE Xplore Demo</td>
| Alerts
<A href='/xpl/RecentCon.jsp?punumber=10417'>Evolutionary Computation, 2005. The 2005 IEEE Congress on</A><br>
Abstract
<td>View Session History</td>
<td>New Search</td>
<a href="http://web5s.silverplatter.com/webspirs/start.ws?customer=kaynak"
Türk Standartlarý
Web of Science
<a href='deneme.html#bg'>Butler Group </a>veritabanýna 31 Mart 2007 tarihine kadar deneme eriþimi alýnmýþtýr. <span class="tarih">(19.03.2007)</span>
<a href='deneme.html#ps'>Productscan</a> veritabanýna 31 Mart 2007 tarihine kadar deneme eriþimi alýnmýþtýr. <span class="tarih">(19.03.2007)</span>
I am supposed to match text like this
href="history.jsp"
That is I need to match any href in the above text.
Now according to Solutions, it seems like the answer for this is href=(['"]).+?\1
But that last backreference, if I don't use it and repeat the regex group(I hope parenthesis is called group, correct me if I am wrong), why am I getting different results? That is if I use this I am getting wrong results. href=(['"]).+?['"] or href=(['"]).+?(['"])
The backreference has to match the same thing that the capture group matched. So the first regexp will match
"abcd"
or
'abcd'
The second version doesn't link the two ends of the match, so it will match the following as well:
"abcd'
or
'abcd"
So the version with the back-reference only matches a string surrounded by the same types of quotes.
This difference is important if you have embedded quotes in a string, e.g.
some text "<div id='foo'>" more text
The version with the back-reference will match "<div id='foo'>", but the version without the back-reference will match "<div id='.
The regex snippet (['"]).+?\1 captures the opening quote with (...), and uses a back-reference to use it later on with \1. That means that 'xyzzy' or "plugh" will match but not 'twisty".
That's probably the correct form since, with (['"]).+?['"], it can open and close with either quote.
As an aside, there's little point capturing the groups in your latter expression, unless you're going to use them in the code somehow. If you capture both, you could check to ensure they're identical but that's probably best handled by the use of the back-reference version.
In other words, if you wanted to allow something like 'twisty", all you need is ['"].+?['"].

Get link text before specific div using regex [duplicate]

This question already has answers here:
Python/BeautifulSoup - how to remove all tags from an element?
(7 answers)
Closed 3 years ago.
I'm doing some code in other to scrape a page for a specific search result, but the main problem is using regex with python.
Here is part of the website source:
<div class="title_block">
<div class="ttl-oss"> </div>
TEXT-TO-CATCH
</div>
The div ttl-oss appears just one time in the page, so my ideia is to use regex in other to search for the unique div, and get the first link text after it like (TEXT-TO-CATCH).
The problem is if I use some regex like <div class="title_block">.*?(<a.*?>)+ I'm not able to find the div and get the text.
Any new approach in how to solve it, is welcome.
Thank you
HTML is usually better handled by an HTML parser, and several are available for python. Regex in general isn't flexible enough for complicated HTML.
However, this should get the text you're looking for, assuming your page looks similar to the one you've posted as an example.
<div class="ttl-oss">[\s\S]*?<a[^>]*href.*>(.*)<\/a>
This regex looks for a div structured as you described in your example, looks for the first anchor tag it finds past that which has "href" in it, and then captures the first chunk of text after the closing >, capturing up to the closing </a> tag.
Demo

Regex to not match inside html anchor tag [duplicate]

This question already has answers here:
What is the best way to parse html in C#? [closed]
(15 answers)
Closed 5 years ago.
I have a requirement where I don't have to match a specific word when in occurs between anchor tag. Anchor tags can have other html tags nested.
For Example:
<a title="Test" href="http://www.google.com/"><span style="color: blue;">Test</span></a><p>Test - MANUALLY<br /><br />Google </p><p> Resolving as duplicate of Test</p><p>Test test</p>
Here every "Test" gets selected. All I want here is getting only "Test" not present inside "anchor tag" and also not part of attributes of "anchor tag".
Regex I used was:
(?!<a[^>]*>)(Test)(?![^<]*<\/a>)/gi
Not sure if this will accomplish your needs, but the second capturing group should only include matches that do not fall within the anchor tag.
(<a.*?<\/a>)|(test)/gi
https://regex101.com/r/rTLifk/1
However, I would highly recommend utilizing an XML parser or XPath.

Regex to remove attribute from xhtml document? [duplicate]

This question already has answers here:
Closed 11 years ago.
Possible Duplicate:
RegEx match open tags except XHTML self-contained tags
How to remove single attribute with quotes via RegEx
I am trying to remove the "sfref" attribute from the html code below:
<a sfref="[Libraries]719c25f9-89b3-4a7c-b6d5-e734b0c06ac1" href="../../HPLC.sflb.ashx">Determination</a> <br />
<img sfref="[Libraries]3e60aebb-acac-4806-bd22-f7986f66e7b3" src="../../Note52011.sflb.ashx">Test</a><br />
So far I have come up with this regex, but it is not matching:
(sfref=")([a-zA-Z0-9:;.\s()-\,]*)(")
This is where I am testing if it help:
http://regexr.com?2v4h6
Can someone please help me remove the "sfref" attribute?
You really really really shouldn't use regex (see the link in #Jack Maney's comment), but if you have to, this should work:
sfref="[^"]*"
This will work for single or double quotes.
sfref=('|").*?\1

ColdFusion Regex for finding empty html tags

Hey all, I'm trying to dynamically strip out some empty html tags. I'm kind of new to Regex, and it seems like the engine for coldfusion isn't as robust/similar to other regex engines (like javascript and as3).
What's the trick for building a regex that ignores spaces in coldfusion 8? So, if I build this thing out I want it to work on either of the examples below.
<p > </p>
<p> </p>
<P></p>
Any help would be really greatful!
This should work: <\w+[^>]*(/>|>\s*?</\w+>). I think. There are no complex, language specific features (i.e. loohaheads, lookbehinds, etc.)
Modified from here: Regular expression to remove empty <span> tags