Regular Expression to exclude a String around the required String - regex

In between a HTML code:
...<div class="..."><a class="..." href="...">I need this String only</a></div>...
How do I write Regular Expression (for Rainmeter which uses Perl RegEx) such that:
-required string "I need this String only" is grouped to be extracted,
-the HTML link tag <a>...</a> might be
absent or present & can be present in between the required string and multiple times as well.
My attempt:
(?siU) <div class="...">.*[>]{0,1}(.*)[</a>]{0,1}</div>
where:
.*= captures every characters except newline{<a class ... "}
[>]{0,1}= accepts 0 or 1 times presence of > {upto >}
(.*)= captures my String
[</a>]{0,1}= accepts 0 or 1 times presence of </a>
this, of course, doesn't work as I want,
This gives output with HTML linking preceding my string
so my question is
How to write a better(and working) RegEx?

Even though I agree with the advice to use a real parser for this problem, this regular expression should solve your problem:
<div [^.<>]|*>(?:[^<>]*<a [^<>]*>)*([^<>]*)(?:</a>)*</div>
Logic:
require <div ...> at the beginning and </div> at the end.
allow and ignore <a ...> before the matched text arbitrarily many times
allow and ignore </a> after the matched text arbitrarily many times
ignore any text before any <a ...> with [^<>]* in front of it. Using .* would also work, but then it would skip all text arbitrarily up to the last instance of <a ...> in your string.
I use [^<>]* instead of .* to match non-tag text in a protected way, since literal < and > are not allowed.
I use (?:...) to group without capturing. If that is not supported in your programming language, just use (...) instead, and adjust which match you use.
Caveat: this won't be fully general but should work for your problem as described.

Related

Regex - match every possible char and space

I want to extract data from html. The thing is, that i cant extract 2 of strings which are on the top, and on the bottom of my pattern.
I want to extract 23423423423 and 1234523453245 but only, if there is string Allan between:
<h4>###### </h4> said12:49:32
</div>
<a href="javascript:void(0)" onclick="replyAnswer(##########,'GET','');" class="reportLink">
report </a>
</div>
<div class="details">
<p class="content">
Hi there, Allan.
</p>
<div id="AddAnswer1234523453245"></div>
Of course, i can do something like this: Profile\/(\d+).*\s*.*\s*.*\s*.*\s*.*\s*.*\s*.*\s*.*Allan.*\s*.*\s*.*AddAnswer(\d+). But the code is horrible. Is there any solution to make it shorter?
I was thinking about:
Profile\/(\d+)(.\sAllan)*AddAnswer(\d+)
or
Profile\/(\d+)(.*Allan\s*)*AddAnswer(\d+)
but none of wchich works properly. Do you have any ideas?
You can construct a character group to match any character including newlines by using [\S\s]. All space and non-space characters is all characters.
Then, your attempts were reasonably close
/Profile\/(\d+)[\S\s]*Allan[\S\s]*AddAnswer(\d+)/
This looks for the profile, the number that comes after it, any characters before Allan, any characters before AddAnswer, and the number that comes after it. If you have single-line mode available (/s) then you can use dots instead.
/Profile\/(\d+).*Allan.*AddAnswer(\d+)/s
demo
You can use m to specify . to match newlines.
/Profile\/(\d+).+AddAnswer(\d+)/m
Better use a parser instead. If you must use regular expressions for whatever reason, you might get along with a tempered greedy solution:
Profile/(\d+) # Profile followed by digits
(?:(?!Allan)[\S\s])+ # any character except when there's Allan ahead
Allan # Allan literally
(?:(?!AddAnswer)[\S\s])+ # same construct as above
AddAnswer(\d+) # AddAnswer, followed by digits
See a demo on regex101.com

Find specfic HTML tag with poorly formed closing tag

I'm having problems refining a REGEX find/replace for a specific XHTML tag (in this example, IMG tags) that don't have /> closing tags.
Consider this two-lines example text:
<div id="newdocs-logo"><img src="../../../_DOC_DEPT/common/logo-white-250w.gif" alt="CloudPassage logo" height="38" width="251" /></div>
<p class="newdocs-indent"><img src="CSM/config-scanning.png" width="692" height="359"></p>
The following REGEX works properly only if I include a-z in capture group #3's negated character class:
(<img)(.*?)([^a-z\/])(\>)
Replacement string:
$1$2$3/$4
I have to include a-z in the character class BECAUSE if I don't, then in line 1 of the example text the REGEX continues past the properly closed IMG tag and finds the closing tag of the DIV tag. I've gone 'round in circles experimenting with look-aheads/behinds and so on but can't come up with anything better.
SO although I have a workable solution, I'm keen to learn if there's a more elegant way to do this that doesn't require a-z in the negated character class.
This is actually really simple to do with a regex. Empty tags like img are actually really regular.
Assuming that there is at least one one character between img and >, this regex will work:
(<img[^>]*[^\/])>
Basically, it captures <img, then everything until the last character before the >. If that character is not /, you will get the match, and can use the replacement string: $1/>.
(If you don't get a match, then your tag is already closed properly.)

Regex Match All Characters Between Tags on nth occurrence

I need to match text between two tags, but starting at a specific occurrence of the tag.
Imagine this text:
Some long <br> text goes <br> here. And some <br> more can <br> go here.<br>
In my example, I would like to match here. And some.
I successfully matched the text between the first occurrence (between the first and second br tags) with:
<br>(.*?)<br>
But I am looking for the text in the next match (which would be between the second and third br tags). This is probably more obvious than I realize, but Regex is not my strong suite.
Just extend your regex:
<br>(.*?)<br>(.*?)<br>
or, for an unlimited number of matches, and trimming the spaces:
<br>\s*(.*?)(?=\s*<br>)
EDIT: Now that I see that you are parsing an HTML document, be aware that regular expressions may not be the best tool for that job, especially if your parsing requirements are complex.

Find a block of descriptive text inside html using regex

I'm trying to figure out how to use look-ahead to try to capture the descriptive text in an html page such as
<div class="itemBanner" style="float:left; padding:10px">
<div style="padding-right:5px; padding-bottom:5px">
<div class="itemBanner">
HTML Tags Stripper is designed to strip HTML tags from the text. It will also strip embedded JavaScript code, style information (style sheets), as well as code inside php/asp tags (<?php ?> <%php ?> <% %>). It will also replace sequence of new line characters (multiple) with only one. <b>Allow tags</b> feature is session sticky, i.e. it will remember allowed tags list, so you will have to type them only once.<p></p>You can either provide text in text area below, or enter URL of the web page. If URL provided then HTML Tags Stripper will visit web-page for its contents.<p></p>
<b>Known issues:</b><br />
I figured a regex that looks for a '>' followed by at least 150 characters before a '<' would do the trick.
The closest I've gotten so far is:
(([^.<]){1,500})<
Which still misses on things like periods and other characters before and after the string.
Your regex will match anything that's neither "." nor "<" 1 to 500 times, then a "<".
Assuming you want to capture everything from the itemBanner div until the very next occurrence of a closing div, you can use these elements:
<div class="itemBanner"> - explicit match
() - parathentical wrap for referencing, e.g. match[1]
.*? - any length of characters, non-greedily (as few as possible)
<\/div> - explicit match, with escaped '/'
to form this Ruby regex:
item_banner_div_regex = /<div class="itemBanner">(.*?)<\/div>/
match = item_banner_div_regex.match(html)
inside_item_banner_div = match && match[1]
Note: The exact regex will depend on the implementation you're using.

What Yahoo Pipes regex use in this case?

have you any ideas how to change in item. description in Yahoo.pipes this link
<img src="http://mysite.com/img/pc/image.gif" class="big" style="background-image:url(http://mysite.com/pre_big_crop/pic/pc/gallery/dd/c1/example.jpeg);" alt="" title="">
to this
<img src="http://mysite.com/pre_big_crop/pic/pc/gallery/dd/c1/example.jpeg"/>
using regex.
I don't know what variant of RegEx Pipes uses, so I'll go with the .NET variant and you can adjust for whatever syntax is needed. It should be pretty close.
Search for:
<img[^>]+url\(
([^\)]+)
\)[^>]+>
Replace with:
<img src="$1" />
Join the lines. Line 1 finds an image tag up to the url argument in the CSS style attribute. Line 2 matches the background image URL and captures it. Line 3 matches the rest of the image tag.
Here is an extremely simple regex to accomplish what you're looking for using PERL style Regexs:
<img.*background-image:url\((.*)\);.*>
Basically, here is the breakdown on how it matches:
It will start by matching the characters "
It then matches any characters, between 0 and unlimited times.
Then it matches the string "background-image:url(
Then it matches any characters, between 0 and unlimited times, which is captured into backreference #1
Then it matches the characters ");"
Then it matches any characters, between 0 and unlimited times.
Then it matches the ">" character.
Note: You should replace the items that match any characters to something more specific, depending on the application that you're using the regex. This is why I've referred to this as "extremely simple".
Then, that gets replaced with:
<img src="$1">
Edit: Didn't see richardtallent's answer, pretty similar application just a different implementation.