How to make a non-greedy regex for following? - regex

I have something like this:
...
<div class="viewport viewport_h" style = "overflow: hidden;" >
<div id="THIS" class="overview overview_h">
<ul>
<li>some txt to be captured</li>
<li>some txt to be captured</li>
<li>some txt to be captured</li>
</ul>
<div>
" some text to be captured"
</div>
</div>
</div>
"some text not to be captured"
</div>
<div class="scrollbar_h">
<div class="track_h"></div>
...
I want to capture everything inside div with id=THIS. I'm using somthing like:
#<div class="viewport viewport_h" style = "overflow: hidden;" >\s*<div class="overview overview_h">\s*(?:<ul>)?([\s\d\w<>\/()="-:;‘’!,:]+)(?:</div>)+?#
The last (?:</div>)+? is to make it non-greedy for further "</div>" but that doesn't work and captuers all other following </div>. :(

As said in comments regex is not a proper way for parsing (?:X|H)TML documents.
Let consider your example one straight way for that is following regex :
<div[^>]*id="THIS"[^>]*>(.*?)</div>
DEMO
That will match following text :
<ul>
<li>some txt to be captured</li>
<li>some txt to be captured</li>
<li>some txt to be captured</li>
</ul>
<div>
" some text to be captured"
</div>
As you can see its not the proper result as you need another </div> so you need to count the open divs to be able to detect the closing divs
that its all based on the language you are using.
Now in this case if you want to create a none-greedy ending dive you need to put a dot before + like following :
<div[^>]*id="THIS"[^>]*>(.*?)(</div>).+?
DEMO
Now it will match another </div> but still its hard for regex to detect the true result (its more complicated for another situation).and it's the reason that the proper way for parsing (?:X|H)TML is using a (?:X|H)TML Parser

Related

How to remove li tags with in Particular DIV tag in notepad ++ using regex

I have content like below
enter code here
<div class="content1">
<ul>
<li>line1</li>
<li>line2</li>
<li>line3</li>
</ul>
</div>
<div class="content2">
<ul>
<li>line4</li>
<li>line5</li>
<li>line6</li>
</ul>
</div>
I want to strip all li tags within and retain contents inside it. like below
enter code here
<div class="content1">
<ul>
line1
line2
line3
</ul>
</div>
<div class="content2">
<ul>
<li>line4</li>
<li>line5</li>
<li>line6</li>
</ul>
</div>
I have about 500 html files to edit.Is there any Regex code to achieve this in notepad++.
You can use a regex like this
<li>(.*?)<\/li>
With the replacement string:
$1
Working demo
The regex to match those tags are
\<li\>
\<\/li\>
The backslashes are used to treat special characters as 'normal' characters.
If you use terminal you can use stream edit which is
sed 's/\<li\>//' input.txt > output.txt
But in notepad++ i believe you can ctrl find and replace

Sublime Text Regex Search for alphanumeric string, not working..

I'm trying to replace a common theme used in hundreds of pages in my project:
<div id="PageTitle"> (Page title as a string) </div>
And the title varies each page. I want to replace it with
<div class="row">
<div class="col-md-12 col-sm-12">
<h3><?= $pageTitle?></h3>
</div>
</div>
I've tried searching with <div id="PageTitle">/^\w+$/</div>, and <div id="PageTitle">"^[a-zA-Z0-9_]*$"</div> with no luck. Any ideas?
You are almost there. Looks like you got the pattern from somewhere else. ^ and $ are starting and ending anchors so they match with the start and end of an input so you should probably get rid of them.
Next if your page title is only going to contain alphanumeric characters (no spaces too) then \w is fine, else you might want to use . instead.
<div id="PageTitle">\w+<\/div>
For a title containing any character:
<div id="PageTitle">.+?<\/div>
Here's a demo
Hope this helps!
Try this one as well, I think its pretty strict:
<div id="PageTitle">(?:(?!<\/div>).)+<\/div>
Or even:
<div id="PageTitle">[\s\S]*?<\/div>

Parse specific div from raw text using regex?

So I'm in a situation that requires parsing raw HTML data as a string, this is unavoidable unfortunately otherwise I wouldn't post this. I only need regex to match the class of a div that has an img tag as a child.
So this is the code example that I'm dealing with:
<div class="summary">
<h3>Example</h3>
<div class="explanation">
<span>This serves as an example for the site.</span>
</div>
<div class="user-details">
mheathershaw<br>
<img src="res/badge522.png"/> <span class="score">522</span>
</div>
<div class="help">
Help
</div>
</div>
And the div that I'd like to retrieve the class from is the div that contains the image. The exact capture from this example that I'd like (optimally) is user-details. The criteria for capturing it is simply if it has <img ... /> as a child.
Anyone able to help? Thanks!
You may try this,
/<div\b[^>]*\bclass="([^"]*)"[^>]*>(?:(?!<\/div>)[\s\S])*?<img\b[^>]*>(?:(?!<\/div>)[\s\S])*?<\/div>/
DEMO

How can I extract URLs from html content with ruby regexp?

Lets go directly with an example since it is not easy to explain:
<li id="l_f6a1ok3n4d4p" class="online"> <div class="link"> random strings - 4 <a style="float:left; display:block; padding-top:3px;" href="http://www.webtrackerplus.com/?page=flowplayerregister&a_aid=&a_bid=&chan=flow"><img border="0" src="/resources/img/fdf.gif"></a> <!-- a class="none" href="#">random strings - 4 site2.com - # - </a --> </div> <div class="params"> <span>Submited: </span>7 June 2015 | <span>Host: </span>site2.com </div> <div class="report"> <a title="" href="javascript:report(3191274,%203,%202164691,%201)" class="alert"></a> <a title="" href="javascript:report(3191274,%203,%202164691,%200)" class="work"></a> <b>100% said work</b> </div> <div class="clear"></div> </li> <li id="l_zsgn82c4b96d" class="online"> <div class="link"> <a href="javascript:show('zsgn82c4b96d','random%20strings%204',%20'site1.com');%20" onclick="visited('zsgn82c4b96d');" style
In the above content i want to extract from
javascript:show('f6a1ok3n4d4p','random%20strings%204',%20'site2.com')
the string "f6a1ok3n4d4p" and "site2.com" then make it as
http://site2.com/f6a1ok3n4d4p
and same for
javascript:show('zsgn82c4b96d','random%20strings%204',%20'site1.com')
to become
http://site1.com/zsgn82c4b96d
I need it to be done with ruby regex
This should give you some insight of how to do it.
https://regex101.com/r/wD4oT8/2
javascript:show\(\'(.*?)'.*?\'([^\']*)\'\) will capture the first argument as $1, last part within ' as $2, so you get what you want by substituting as $2/$1.
That's the regex part of it, and, of course, you can adjust the regex as you see fit, for example, to include the usage of " (javascript:show\((?:\'|\")(.*?)(?:\'|\").*?\'([^\'\"]*)(?:\'|\")\) or allow only with 3 arguments.
/yourregex/.match(yourstring) will extract the information you need.

How to skip a particular tag and crawl other tag's text in Beautifulsoup

I am crawling a webpage and i am using Beautifulsoup. There is a condition where i want to skip the content of one particular tag and get other tag contents. In the below code i don't want div tag contents. But i couldn't solve this. Please help me.
HTML code,
<blockquote class="messagetext">
<div style="margin: 5px; float: right;">
unwanted text .....
</div>
Text..............
<a class="externalLink" rel="nofollow" target="_blank" href="#">text </a>
<a class="externalLink" rel="nofollow" target="_blank" href="#">text</a>
<a class="externalLink" rel="nofollow" target="_blank" href="#">text</a>
,text
</blockquote>
I have tried like this,
content = soup.find('blockquote',attrs={'class':'messagetext'}).text
But it is fetching unwanted text inside div tag also.
Use the clear function like this:
soup = BeautifulSoup(html_doc)
content = soup.find('blockquote',attrs={'class':'messagetext'})
for tag in content.findChildren():
if tag.name == 'div':
tag.clear()
print content.text
This yields:
Text..............
text
text
text
,text