This question already has answers here:
Python/BeautifulSoup - how to remove all tags from an element?
(7 answers)
Closed 3 years ago.
I'm doing some code in other to scrape a page for a specific search result, but the main problem is using regex with python.
Here is part of the website source:
<div class="title_block">
<div class="ttl-oss"> </div>
TEXT-TO-CATCH
</div>
The div ttl-oss appears just one time in the page, so my ideia is to use regex in other to search for the unique div, and get the first link text after it like (TEXT-TO-CATCH).
The problem is if I use some regex like <div class="title_block">.*?(<a.*?>)+ I'm not able to find the div and get the text.
Any new approach in how to solve it, is welcome.
Thank you
HTML is usually better handled by an HTML parser, and several are available for python. Regex in general isn't flexible enough for complicated HTML.
However, this should get the text you're looking for, assuming your page looks similar to the one you've posted as an example.
<div class="ttl-oss">[\s\S]*?<a[^>]*href.*>(.*)<\/a>
This regex looks for a div structured as you described in your example, looks for the first anchor tag it finds past that which has "href" in it, and then captures the first chunk of text after the closing >, capturing up to the closing </a> tag.
Demo
Related
I'm trying to perform a regex replacement on the HTML below. I'm using an existing (I didn't write it and don't really understand it) regex pattern that ignores anything inside of an HTML tag, but I need it to also ignore anything between script tags. The pattern is (?<!<[^>]*)(diversity|and|inclusion). The problem is that the and in 'playerBrandingId' in the javascript is getting matched and ultimately replaced. In case it matters, I'm using C#. You can see what I get here.
<p>When it comes to building more diverse and inclusive workforces, the sports industry is already a leader, but it can do much more. One of the ways SBD/SBJ is focusing on diversity and inclusion is by talking to business leaders about what the industry can do better. In our first video in the “SBJ Diversity and Inclusion” series, we hear from execs working in leagues, technology, recruitment and academia.</p>
<div class="article-offset-block article-video article-offset-block--half">
<div class="u-vr2">
<div id='video-F17F523A70EB43ECAF54DF46144835B4'></div>
</div>
</div>
<script>
var playerParam = {
'pcode': 'poeXI63BtIsR_ugBoy3Z6X8KfiMo',
'playerBrandingId': 'video-F17F523A70EB43ECAF54DF46144835B4',
'autoplay': false,
'loop': false
};
OO.ready(function () { window.ppF17F523A70EB43ECAF54DF46144835B4 = OO.Player.create('video-F17F523A70EB43ECAF54DF46144835B4', 'w5cW9qZTE6qRRDqfBdi861XWJTXci9uE', playerParam); });
</script>
EDIT:
The pattern is generated by a user's query, so the pattern could include the word window or player which would be matched in the javascript when I change the pattern to include the \b like so: (?<!<[^>]*)\b(window|player|and)\b
Another example
Change your regex to (?<!<[^>]*)\b(diversity|and|inclusion)\b The \b adds a test for a word boundary. forcing each word inside the ( and ) to be whole words.
EDIT:
You are trying to parse the HTML to extract the text nodes then check them,
you should not under any circumstances try to parse HTML with a regex unless you wish to invoke rite 666 Ph'nglui mglw'nafh Cthulhu R'lyeh wgah'nagl fhtagn.
Use an HTML parsing library see this page for some ways to do it or search for extracting text nodes from HTML with .NET and C#
The answer is that you cannot do what I'm trying to do with Regex according to this.
This question already has answers here:
RegEx match open tags except XHTML self-contained tags
(35 answers)
Closed 3 years ago.
I am currently working on an AIR app and I'm trying to get a certain block of text from a website where that block of text is always between two specific strings that contain links that change from page to page.
It looks something like this:
<p>Previous Chapter <span style="float: right">Next Chapter</span></p>
.
.
_desired content_
.
.
<p>Previous Chapter <span style="float: right">Next Chapter</span></p>
*The two strings are identical
Now, I have tried several RegEx expressions but without success. I just can't get my head around Regex in general...
The last expression I've tried is: /(?<=<p><a href=\".+\">Previous Chapter<\/a> <span style=\"float: right\"><a href=\".+\">Next Chapter<\/a><\/span><\/p>)(.*)(?=<p><a href=\".+\">Previous Chapter<\/a> <span style=\"float: right\"><a href=\".+\">Next Chapter<\/a><\/span><\/p>)/gsi
but that one isn't even being recognized as a RegEx.
I would really appreciate any help with the subject.
Thanks in advance!
EDIT:
Thanks to Organis's help I managed to solve the problem, it was indeed easier and better NOT using RegEx.
This is what i ended up doing:
text=text.split("Next Chapter<\/span><\/a><\/p>")[1].split("Previous Chapter<\/a>")[0];
text=text.substring(0,text.lastIndexOf("<p><a href"));
Do not use RegEx. Read why: https://blog.codinghorror.com/parsing-html-the-cthulhu-way/.
Extract text between two fixed <span style="float: right">Next Chapter</span></a></p>, then cut finalizing <p>Previous Chapter <a href="**changes**"> off.
This question already has answers here:
What is the best way to parse html in C#? [closed]
(15 answers)
Closed 5 years ago.
I have a requirement where I don't have to match a specific word when in occurs between anchor tag. Anchor tags can have other html tags nested.
For Example:
<a title="Test" href="http://www.google.com/"><span style="color: blue;">Test</span></a><p>Test - MANUALLY<br /><br />Google </p><p> Resolving as duplicate of Test</p><p>Test test</p>
Here every "Test" gets selected. All I want here is getting only "Test" not present inside "anchor tag" and also not part of attributes of "anchor tag".
Regex I used was:
(?!<a[^>]*>)(Test)(?![^<]*<\/a>)/gi
Not sure if this will accomplish your needs, but the second capturing group should only include matches that do not fall within the anchor tag.
(<a.*?<\/a>)|(test)/gi
https://regex101.com/r/rTLifk/1
However, I would highly recommend utilizing an XML parser or XPath.
I have some html which I want to grab between 2 tags. However nested tags exist in the html so looking for wouldn't work as it would return on the first nested div.
Basically I want my regex to..
Match some text literally, followed by ANY character upto another literal text string. So my question is how do I get [^<]* to continue matching until it see's the next div.
such as
<div id="test"[^<]*<div id="test2"
Example html
<div id="test" class="whatever">
<div class="wrapper">
<fieldset>Test</fieldset><div class="testclass">some info</div>
</div>
<!-- end test div--></div>
</div>
<div id="test2" class="endFind">
In general, I suspect you want to look at "greedy" vs "lazy" in your regex, assuming that's supported by your platform/language.
For example, <div[^>]*>(.*?)</div> would make $1 match all the text inside a div, but would try to keep it as small as possible. Some people call *? a "lazy star".
But it seems you're looking to find the text within a div that is before the start of the first nested div. That would be something like <div[^>]*>(.*?)<div
Read about greedy vs lazy here and check to make sure that whatever language you're using supports it.
$ php -r '$text="<div>Test<div>foo</div></div>\n"; print preg_replace("/<div[^>]*>(.*?)<div.*/", "\$1", $text);'
Test
$
Regex is not capable of parsing HTML. If this is part of an application, you're doing something wrong. If you absolutely have to parse a document, use a html/xml parser.
If you're trying to screen scrape something and don't want to bother with a parser, look for identifying marks in the page you're scraping. For example, maybe the embedded div ends just before the one you want to match, so you could match </div></div> instead.
Alternatively, here's a regex that meets your requirements. However, it is very fragile: it will break if, for example, #test's children have children, or the html isn't valid, or I missed something, etc, etc ...
/<div id="test"[^<]*(<([^ >]+).+<\/$2>[^<]*)*<\/div>/
Hey all, I'm trying to dynamically strip out some empty html tags. I'm kind of new to Regex, and it seems like the engine for coldfusion isn't as robust/similar to other regex engines (like javascript and as3).
What's the trick for building a regex that ignores spaces in coldfusion 8? So, if I build this thing out I want it to work on either of the examples below.
<p > </p>
<p> </p>
<P></p>
Any help would be really greatful!
This should work: <\w+[^>]*(/>|>\s*?</\w+>). I think. There are no complex, language specific features (i.e. loohaheads, lookbehinds, etc.)
Modified from here: Regular expression to remove empty <span> tags