I want to record a macro for Notepad++ to find several Texts which are inside a xml-document with some -tags and a lot of other XML-Tags. So I want to use regex and need a little of help. I think I'm quite close.
example: <Caption>ThetextIwanttofind</Caption>
my regex: <Caption\b[^>]*>(.*?)</Caption>
The problem is the closing Caption-tag. How to rewrite my regex to get the inner text with the closing Caption?
Thx for your help!
<Caption\b[^>]*>(.*?)<Caption> --> works for Caption without a closing tag
One solution would be to use :
<Caption\b[^>]*>(.*?)<\/?Caption>
^
But it's kind of ugly
Related
I need to remove some tags from a whole lot of html pages.
Lately I discovered the option of regex in Notepad++
But.. Even after hours of Googling I don't seem to get it right.
What do I need?
Example:
<p class=MsoNormal style='margin-left:19.85pt;text-indent:-19.85pt'><spanlang=NL style='font-size:11.0pt;font-family:Symbol'>ยท<span style='font:7.0pt "Times New Roman"'> </span></span><span lang=NL style='font-size:9.0pt;font-family:"Arial","sans-serif"'>zware uitvoering met doorzichtige vulruimte;</span></p>
I need to remove everything about styling, classes and id's. So I need to only have the clean tags without anything else.
Anyone able to help me on this one?
Kind regards
EDIT
Check an entire file via pastebin: http://pastebin.com/0tNwGUWP
I think this pattern will erase all styles in "p" and "span" tags :
((?<=<p)|(?<=<span))[^>]*(?=>)
=> how it works:
( (?<=<p) | (?<=<span) ): This is a LookBehind Block to make sure
that the string we are looking for comes after <p OR <span
[^>]* : Search for any character that is not a > character
(?=>) : This is a LookAfter block to make sure that the
string we are looking for comes before > character
PS: Tested on Notepad ++
If sample you provided is representative of what you need to process, then, the following quick and dirty solution will work:
Find what: [a-z]+='[^']*'
Replace with:
Find what: [a-z]+=[a-zA-Z]*
Replace with:
You must run the first one first to pick up the style='...' attributes and you'll need to run the second next to pickup both the class='...' and lang='...'.
There's good reason why others posters are saying don't attempt to parse HTML this way. You'll end up in all sorts of trouble since regex, in general cannot handle all the wonderful weirdness of HTML.
My advise as follows.
As I see in your sample text you have only "p" and "span" tags that need to be handled. And you apparently want to remove all the styles inside them. In this case, you could consider removing everything inside those tags, leave them simple <p> or <span>.
I don't know about Notepad++ but a simple C# program can do this job quickly.
Assuming <spanlang=NL a typo (should be <span lang=NL), I'd do:
Find what: (<\w+)[^>]*>
Replace with: $1>
If you don't mind doing a little bit of programming: HTMLAgilityPack can easily remove scripts/styles/wathever from you xml/html.
Example:
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(html);
doc.DocumentNode.Descendants()
.Where(n => n.Name == "script" || n.Name == "style")
.ToList()
.ForEach(n => n.Remove());
I'm working with a really large spreedsheet in Open Office and I've had to learn regular expressions to clean it up.
Right now I'm trying to remove all <span> tags and I've come up with an expression to do so:
(<span.*?>|</span>)
The problem is that OpenOffice doesn't seem to like the question mark (which should make it ungreedy), so when I try to remove the <span> tags, it removes most of my string.
Here is a sample of the data: http://pastebin.com/AKWZJJCv
What is an alternative way of reming the <span> tags that would work in OpenOffice's find and replace?
You could also try (<span[^>]*>|</span>)
Give this a try:
<(\/)?span([a-zA-z\-\="0-9 ]*)?>
Tested here.
Hello I'm trying to match multi-nested quote's blockquotes and transform them back into BBCode
This is what I got so far as far as regex is involved
Converted it back to html entities to be seen on stackedoverflow
<div class="quoteheader"><div class="topslice_quote">([\s\S]*?)</div></div><blockquote>([\s\S]*?)(?:</blockquote><div class="quotefooter"><div class="botslice_quote"></div></div>){2,})
I'm trying to match this
<div class="quoteheader"><div class="topslice_quote">Quote</div></div><blockquote>Outside quote is this
<div class="quoteheader"><div class="topslice_quote">Quote</div></div><blockquote>Inner quote is this</blockquote><div class="quotefooter"><div class="botslice_quote"></div></div>
</blockquote><div class="quotefooter"><div class="botslice_quote"></div></div>
to generate this
[quote]Outside quote is
this[quote]Inner quote is
this[/quote][/quote]
I'm using VBScript 5.5 Regeular Expressions for this. (but this isn't that important)
I really need help on the expression. I've tired using a HTML Parser for this but it turns out to be more difficult then using regex
I'm just repeating what's said here.
Regular Expressions can't match Context Free languages, like groups of tags. You can't match opening to closing tags, so parsing a block (Especially a nested one) becomes impossible to do reliably.
You can certainly build a cludge to help, but there will be situations where it won't work.
Well, this is all you need to do with the parser.
Here's the pseudocode. I don't know your parser so this is the best I can offer.
First find the div tag with the quoteheader class. Get the next sibling.
That is the blockquote tag. Let's call this tag theQuote.
Get the first child of theQuote. It will be a html text item. That is the outer quote.
Get the third child of theQuote. It will be another blockquote tag. Let's call this tag theInner.
Get the first child of theInner. It will be a html text item. That is the inner quote.
Sorry this might be a simple question, but I could not figure it out. What I need is to filter out all the <a href...> and </a> strings out from a html text. Not sure what regular expression I should use? I tried the following search without any luck:
/<\shref^(>)>
what I mean here is to search for any string starting with "< href" and any string not containing '>' and finally '>'. My search code is not working. What is the correct one?
If I understand what you're looking for it should be <\shref[^>]*>.
Another way would be to use non-greedy matching:
/<a\shref.\{-}>
I think I got it:
/<a\shref[^>]+>
where [] is a set and ^ is not.
I have a string like this:
This <span class="highlight">is</span> a very "nice" day!
What should my RegEx-pattern in VB look like, to find the quotes within the tag? I want to replace it with something...
This <span class=^highlight^>is</span> a very "nice" day!
Something like <(")[^>]+> doesn't work :(
Thanks
It depends on your regex flavor, but this works for most of them:
"(?=[^<]*>)
EDIT: For anyone curious how this works. This translates into English as "Find a quote that is followed by a > before the next <".
Regexes are fundamentally bad at parsing HTML (see Can you provide some examples of why it is hard to parse XML and HTML with a regex? for why). What you need is an HTML parser. See Can you provide an example of parsing HTML with your favorite parser? for examples using a variety of parsers.
If you are using VB.net you should be able to use HTMLAgilityPack.
Try this: <span class="([^"]+?)?">
This should get your the first attribute value in a tag:
<[^">]+"(?<value>[^"]*)"[^>]*>
If your intention is to replace ALL quotation marks within tags, you could use the following regular expression:
(<[^>"]*)(")([^>]*>)
That will isolate the substrings before and after your quotation mark. Note that this does not attempt to match opening and closing quotation marks. It simply matches a quotation mark within a tag.