Regex - match every possible char and space - regex

I want to extract data from html. The thing is, that i cant extract 2 of strings which are on the top, and on the bottom of my pattern.
I want to extract 23423423423 and 1234523453245 but only, if there is string Allan between:
<h4>###### </h4> said12:49:32
</div>
<a href="javascript:void(0)" onclick="replyAnswer(##########,'GET','');" class="reportLink">
report </a>
</div>
<div class="details">
<p class="content">
Hi there, Allan.
</p>
<div id="AddAnswer1234523453245"></div>
Of course, i can do something like this: Profile\/(\d+).*\s*.*\s*.*\s*.*\s*.*\s*.*\s*.*\s*.*Allan.*\s*.*\s*.*AddAnswer(\d+). But the code is horrible. Is there any solution to make it shorter?
I was thinking about:
Profile\/(\d+)(.\sAllan)*AddAnswer(\d+)
or
Profile\/(\d+)(.*Allan\s*)*AddAnswer(\d+)
but none of wchich works properly. Do you have any ideas?

You can construct a character group to match any character including newlines by using [\S\s]. All space and non-space characters is all characters.
Then, your attempts were reasonably close
/Profile\/(\d+)[\S\s]*Allan[\S\s]*AddAnswer(\d+)/
This looks for the profile, the number that comes after it, any characters before Allan, any characters before AddAnswer, and the number that comes after it. If you have single-line mode available (/s) then you can use dots instead.
/Profile\/(\d+).*Allan.*AddAnswer(\d+)/s
demo

You can use m to specify . to match newlines.
/Profile\/(\d+).+AddAnswer(\d+)/m

Better use a parser instead. If you must use regular expressions for whatever reason, you might get along with a tempered greedy solution:
Profile/(\d+) # Profile followed by digits
(?:(?!Allan)[\S\s])+ # any character except when there's Allan ahead
Allan # Allan literally
(?:(?!AddAnswer)[\S\s])+ # same construct as above
AddAnswer(\d+) # AddAnswer, followed by digits
See a demo on regex101.com

Related

Regular Expression to exclude a String around the required String

In between a HTML code:
...<div class="..."><a class="..." href="...">I need this String only</a></div>...
How do I write Regular Expression (for Rainmeter which uses Perl RegEx) such that:
-required string "I need this String only" is grouped to be extracted,
-the HTML link tag <a>...</a> might be
absent or present & can be present in between the required string and multiple times as well.
My attempt:
(?siU) <div class="...">.*[>]{0,1}(.*)[</a>]{0,1}</div>
where:
.*= captures every characters except newline{<a class ... "}
[>]{0,1}= accepts 0 or 1 times presence of > {upto >}
(.*)= captures my String
[</a>]{0,1}= accepts 0 or 1 times presence of </a>
this, of course, doesn't work as I want,
This gives output with HTML linking preceding my string
so my question is
How to write a better(and working) RegEx?
Even though I agree with the advice to use a real parser for this problem, this regular expression should solve your problem:
<div [^.<>]|*>(?:[^<>]*<a [^<>]*>)*([^<>]*)(?:</a>)*</div>
Logic:
require <div ...> at the beginning and </div> at the end.
allow and ignore <a ...> before the matched text arbitrarily many times
allow and ignore </a> after the matched text arbitrarily many times
ignore any text before any <a ...> with [^<>]* in front of it. Using .* would also work, but then it would skip all text arbitrarily up to the last instance of <a ...> in your string.
I use [^<>]* instead of .* to match non-tag text in a protected way, since literal < and > are not allowed.
I use (?:...) to group without capturing. If that is not supported in your programming language, just use (...) instead, and adjust which match you use.
Caveat: this won't be fully general but should work for your problem as described.

Regular expression: Remove first match pattern in front and behind certain text

I have the following text.
<span style="color:#FF0000;">赤色</span><span style="color:#0;">|*|</span><span style="color:#0070C0;">青色</span><span style="color:#0;">|*|</span><span style="color:#00B050;">緑色</span><span style="color:#0;">|*|</span>
I need to remove any span tag that defines color for "|*|" only. That is in this case, I need to remove
<span style="color:#0;">
and
</span>
Can anyone help to do that?
Thanks in advance!
You want something like this:
<span[^>]+style="[^"]*color:[^>]+>(\|\*\|)<\/span>
This matches <span, then one or more non-> characters, then a style attribute that contains color:, then the rest of the tag, then |*|, then </span>.
You would replace with $1 or just |*|.
Here's a demo.
Note: one reason your attempt didn't work is that you escaped the |s, but not the *. You need to escape the * as \*.

Regular expression replace start and end, ignore middle

In an Ant build file, is there a way to use a replaceregexp to find and replace two tags, and retain what's in between them? For example, to find all of these:
</a>1234abcdefg</P>
</a>123456789. </p>
</a> yop </p>
</a></p>
and replace
</a> and </p>
with
<#> and <##>
so that I have, respectively:
<#>1234abcdefg##
<#>123456789. <##>
<#> yop <##>
<#><##>
I can't replace the tags individually since they occur in other places, I just want the instances in which </a> is followed by </p>, in the same line, with either nothing or something in between them, and I want to keep what's in between them.
Try this:
<replaceregexp file="notTested.xml" match="(<)\/a(>.*?<)\/p(>)" replace="\1#\2##\3" byline="true" flags="g" />
as for, but it replaces what's between the tags with .* , i haven't seen .* in a replacement/substitution expression. probably it takes it as literals . and *.
as for </a>.*</p>, the > .* < will not work when you have multiple declerations of </a> and </p> on the same line... such as:
</a>1234abcdefg</P>abcde</a>123456789. </p> would be replaced as
<#>1234abcdefg</P>abcde</a>123456789. <##>
you need to use non greedy quantifier ?. See WiKi for the use of .*? vs .*.
Solution 1: You can try this
You store the match with parenthesis, and then replace it.
exp = new Regex(#"YourtagStartRegex(bodyRegex)YourtagClosingRegex");
str = exp.Replace(str, "$1");
Reference:Replace the start and end of a string ignoring the middle with regex, how?
Or
Solution 2:
Regex ignore middle part of capture

how to match any string in Emacs regexp?

I'm referring to this page: http://ergoemacs.org/emacs/emacs_regex.html
which says that to capture a pattern in Emacs Regexp, you need to escape the paren like this: \(myPattern\).
It further says that the syntax for capturing a sequence of ASCII characters is [[:ascii:]]+
In my document, I'm trying to match all strings that occur between <p class="calibre3"> and </p>
So, following the syntax above, I do a replace-regexp for
<p class="calibre3">\([[:ascii:]]+\)</p>
but it finds no matches.
Suggestions?
Regexps are not good for general-purpose HTML parsing, but as paragraph tags cannot be validly nested, the following is going to be fine (provided the mark-up is valid & well-formed).
<p class="calibre3">\(.*?\)</p>
*? is the non-greedy zero-or-more repetitions operator, so it will match as little as possible -- in this case everything until the next </p> (as opposed to the greedy version, which would match everything until the final </p> in the text).
The [^<] approach is fine if it fits the data in question, but it won't work if there are other tags within the paragraphs.
You need to escape your angle brackets and I would use [^<] instead of [[:ascii]] like so:
\<p class="calibre3"\>([^<]+\)</p\>
<p class="calibre3">\([^<]\)+</p>
Source: #TooTone

Regex to exclude multiple strings

Could use some help with Regex searching with NetBeans 7.01's find function.
I'm trying to exclude multiple strings. Specifically, the target lines:
<div class="table_left">
<div class="table_right">
<div class="table_clear">
I need to match only the third and other Div classes that are not either table_left or table_right.
I've tried:
class="table_(((?!left).*)|((?!right).*))
and
class="table_(left|right){0}
I realized while pasting my first Regex line that I'm matching not right OR not left, which is returning both. What is the proper way to specify two conditions? The and operator?
The joys of searching for words that are also Boolean operators...
Try this pattern:
<div\s+class="(?!table_(left|right))[^"]+"
which wouldn't match:
<div class="table_left">
<div class="table_right">
but would match:
<div class="table_clear">
<div class="foo">
EDIT
The HT wrote:
I need to match only classes that begin with table, but are not right or left
Ah, okay, that would look like:
<div\s+class="table_(?!left|right)[^"]+"
or
<div\s+class="table(?!_left|_right)[^"]+"
as you already found yourself (but I included it in my answer for completeness sake).
A quick explanation of the pattern <div\s+class="table_(?!left|right)[^"]+":
<div # match '<div'
\s+ # match one ore more space chars
class="table_(?!left|right) # match 'class="table_' only if it is not followed by 'left' or 'right'
[^"]+ # match one or more characters other than '"'
" # match a '"'