how to match any string in Emacs regexp? - regex

I'm referring to this page: http://ergoemacs.org/emacs/emacs_regex.html
which says that to capture a pattern in Emacs Regexp, you need to escape the paren like this: \(myPattern\).
It further says that the syntax for capturing a sequence of ASCII characters is [[:ascii:]]+
In my document, I'm trying to match all strings that occur between <p class="calibre3"> and </p>
So, following the syntax above, I do a replace-regexp for
<p class="calibre3">\([[:ascii:]]+\)</p>
but it finds no matches.
Suggestions?

Regexps are not good for general-purpose HTML parsing, but as paragraph tags cannot be validly nested, the following is going to be fine (provided the mark-up is valid & well-formed).
<p class="calibre3">\(.*?\)</p>
*? is the non-greedy zero-or-more repetitions operator, so it will match as little as possible -- in this case everything until the next </p> (as opposed to the greedy version, which would match everything until the final </p> in the text).
The [^<] approach is fine if it fits the data in question, but it won't work if there are other tags within the paragraphs.

You need to escape your angle brackets and I would use [^<] instead of [[:ascii]] like so:
\<p class="calibre3"\>([^<]+\)</p\>

<p class="calibre3">\([^<]\)+</p>
Source: #TooTone

Related

Regular expression: Remove first match pattern in front and behind certain text

I have the following text.
<span style="color:#FF0000;">赤色</span><span style="color:#0;">|*|</span><span style="color:#0070C0;">青色</span><span style="color:#0;">|*|</span><span style="color:#00B050;">緑色</span><span style="color:#0;">|*|</span>
I need to remove any span tag that defines color for "|*|" only. That is in this case, I need to remove
<span style="color:#0;">
and
</span>
Can anyone help to do that?
Thanks in advance!
You want something like this:
<span[^>]+style="[^"]*color:[^>]+>(\|\*\|)<\/span>
This matches <span, then one or more non-> characters, then a style attribute that contains color:, then the rest of the tag, then |*|, then </span>.
You would replace with $1 or just |*|.
Here's a demo.
Note: one reason your attempt didn't work is that you escaped the |s, but not the *. You need to escape the * as \*.

Regex - match every possible char and space

I want to extract data from html. The thing is, that i cant extract 2 of strings which are on the top, and on the bottom of my pattern.
I want to extract 23423423423 and 1234523453245 but only, if there is string Allan between:
<h4>###### </h4> said12:49:32
</div>
<a href="javascript:void(0)" onclick="replyAnswer(##########,'GET','');" class="reportLink">
report </a>
</div>
<div class="details">
<p class="content">
Hi there, Allan.
</p>
<div id="AddAnswer1234523453245"></div>
Of course, i can do something like this: Profile\/(\d+).*\s*.*\s*.*\s*.*\s*.*\s*.*\s*.*\s*.*Allan.*\s*.*\s*.*AddAnswer(\d+). But the code is horrible. Is there any solution to make it shorter?
I was thinking about:
Profile\/(\d+)(.\sAllan)*AddAnswer(\d+)
or
Profile\/(\d+)(.*Allan\s*)*AddAnswer(\d+)
but none of wchich works properly. Do you have any ideas?
You can construct a character group to match any character including newlines by using [\S\s]. All space and non-space characters is all characters.
Then, your attempts were reasonably close
/Profile\/(\d+)[\S\s]*Allan[\S\s]*AddAnswer(\d+)/
This looks for the profile, the number that comes after it, any characters before Allan, any characters before AddAnswer, and the number that comes after it. If you have single-line mode available (/s) then you can use dots instead.
/Profile\/(\d+).*Allan.*AddAnswer(\d+)/s
demo
You can use m to specify . to match newlines.
/Profile\/(\d+).+AddAnswer(\d+)/m
Better use a parser instead. If you must use regular expressions for whatever reason, you might get along with a tempered greedy solution:
Profile/(\d+) # Profile followed by digits
(?:(?!Allan)[\S\s])+ # any character except when there's Allan ahead
Allan # Allan literally
(?:(?!AddAnswer)[\S\s])+ # same construct as above
AddAnswer(\d+) # AddAnswer, followed by digits
See a demo on regex101.com

Find specfic HTML tag with poorly formed closing tag

I'm having problems refining a REGEX find/replace for a specific XHTML tag (in this example, IMG tags) that don't have /> closing tags.
Consider this two-lines example text:
<div id="newdocs-logo"><img src="../../../_DOC_DEPT/common/logo-white-250w.gif" alt="CloudPassage logo" height="38" width="251" /></div>
<p class="newdocs-indent"><img src="CSM/config-scanning.png" width="692" height="359"></p>
The following REGEX works properly only if I include a-z in capture group #3's negated character class:
(<img)(.*?)([^a-z\/])(\>)
Replacement string:
$1$2$3/$4
I have to include a-z in the character class BECAUSE if I don't, then in line 1 of the example text the REGEX continues past the properly closed IMG tag and finds the closing tag of the DIV tag. I've gone 'round in circles experimenting with look-aheads/behinds and so on but can't come up with anything better.
SO although I have a workable solution, I'm keen to learn if there's a more elegant way to do this that doesn't require a-z in the negated character class.
This is actually really simple to do with a regex. Empty tags like img are actually really regular.
Assuming that there is at least one one character between img and >, this regex will work:
(<img[^>]*[^\/])>
Basically, it captures <img, then everything until the last character before the >. If that character is not /, you will get the match, and can use the replacement string: $1/>.
(If you don't get a match, then your tag is already closed properly.)

Regular expression replace start and end, ignore middle

In an Ant build file, is there a way to use a replaceregexp to find and replace two tags, and retain what's in between them? For example, to find all of these:
</a>1234abcdefg</P>
</a>123456789. </p>
</a> yop </p>
</a></p>
and replace
</a> and </p>
with
<#> and <##>
so that I have, respectively:
<#>1234abcdefg##
<#>123456789. <##>
<#> yop <##>
<#><##>
I can't replace the tags individually since they occur in other places, I just want the instances in which </a> is followed by </p>, in the same line, with either nothing or something in between them, and I want to keep what's in between them.
Try this:
<replaceregexp file="notTested.xml" match="(<)\/a(>.*?<)\/p(>)" replace="\1#\2##\3" byline="true" flags="g" />
as for, but it replaces what's between the tags with .* , i haven't seen .* in a replacement/substitution expression. probably it takes it as literals . and *.
as for </a>.*</p>, the > .* < will not work when you have multiple declerations of </a> and </p> on the same line... such as:
</a>1234abcdefg</P>abcde</a>123456789. </p> would be replaced as
<#>1234abcdefg</P>abcde</a>123456789. <##>
you need to use non greedy quantifier ?. See WiKi for the use of .*? vs .*.
Solution 1: You can try this
You store the match with parenthesis, and then replace it.
exp = new Regex(#"YourtagStartRegex(bodyRegex)YourtagClosingRegex");
str = exp.Replace(str, "$1");
Reference:Replace the start and end of a string ignoring the middle with regex, how?
Or
Solution 2:
Regex ignore middle part of capture

Regular expression to remove <p> tags around elements wrapped in [...]'s

I'm a total regexp noob. I'm working with wordpress and I'm desperately trying to deal with wordpress's wautop, which I hate and love (more hate!). Anyways I'm trying to remove <p> tags around certain commands.
Here's what I get:
<p>
[hide]
<img.../>
[/hide]
</p>
or
<p>
[imagelist]
<img .../>
<img .../>
[/imagelist]
</p>
Here's what I'd like:
[hide]
<img.../>
[/hide]
or
[imagelist]
<img .../>
<img .../>
[/imagelist]
I've tried:
preg_replace('/<p[^>]*>(\[[^>]*\])<\/p[^>]*>/', '$1', $content); // No luck!
EDIT:
When I am doing the regexp it is still just a variable containing text.. It is not parsed as html yet. I know it is possible because I already did it with getting rid of p tags around an image tag. So I just need a regexp to handle text that will be parsed as html at some point in the future.
Here's a similar question
Thanks!
Matt Mueller
You can't use regular expressions to parse HTML, because HTML is, by definition, a non-regular language. Period, end of discussion.
The language of matching HTML tags is context-free, not regular. This means regular expressions are probably not the right tool to use here. Context-free languages require parsers rather than regular expressions. So, you can either remove ALL <p> and </p> tags with a regular expression, or you can use an HTML parser to remove matching tags from certain parts of your document.
Try this regex:
'%<p[^>]*>\s*(\[([^\[\]]+)\].*?\[/\2\])\s*</p>%s'
Explanation:
\[([^\[\]]+)\] matches the opening bbcode tag and captures the tag name in group #2.
\[/\2\] matches a corresponding losing tag.
.*? matches anything, reluctantly. Thanks to the s flag at the end, it also matches newlines. The effect of the reluctant .*? is that it stops matching the first time it finds a closing bbcode tag with the right name. If tags are nested (within tags with the same name) or improperly balanced, it won't work correctly. I wouldn't expect that be a problem, but I have no experience with WordPress, so YMMV.