Regex to replace &nbsp with - regex

I need to replace &nbsp with but with
My regex is &nbsp([^;])
my replacement is
My code is replacing
<pre><code>
<ul><b> </b>&nbsp</ul>
<tr valign="top"><td width="144"> United Kingdom</td><td width="144"> Pound</td><td width="144"> Pence</td></tr></pre></code>
with
<pre><code>
<ul><b> </b> /ul>
<tr valign="top"><td width="144"> United Kingdom</td><td width="144"> Pound</td><td width="144"> Pence</td></tr>
It is removing the < tag in </ul></pre></code>
Any help?

try the regex with a lookahead
(&nbsp)(?!;)

Use lookaround
(?<=&nbsp)(?=[^;])
OR
(?<=&nbsp)(?!;)
replace it with ;

It would probably be better to capture the non ; character and adding ; infront of it.
Search string:
(&nbsp)([^;])
Replace string:
\1;\2

Related

Capture specific first matches in regex

I have this text and want to capture each match of the letter 'ñ' under the html href attribute. I want it to match the 'ñ' in both niño.html and niña.html, but not the ones in Niño and Niña:
<a href='niño.html'>Niño</a> <a href='niña.html'>Niña</a>
I tried this but it also matches Niño:
ñ(.*?\.html'>)+?
When replacing with n\1, it gives:
<a href='nino.html'>Nino</a> <a href='niña.html'>Niña</a>
What I would want the text to look like is:
<a href='nino.html'>Niño</a> <a href='nina.html'>Niña</a>
How can I do this?
when you try this the part between does not contain the single quote:
ñ([^']*?\.html'>)+?

Regex, How to extract a delimited string and containing some special words?

From the following html script:
<p style="line-height:0;text-align:left">
<font face="Arial">
<span style="font-size:10pt;line-height:15px;">
<br />
</span>
</font>
</p>
<p style="line-height:0;text-align:left">
<font face="AR BLANCA">
<span style="font-size:20pt;line-height:30px;">
[designation]
</span>
</font>
</p>
<p style="line-height:0;text-align:left">
</p>
I want to extract the following part
<font face="AR BLANCA">
<span style="font-size:20pt;line-height:30px;">
[désignation]
</span>
</font>
I tried this regular expression :
<font.*?font>
this could extract separatly two matches, but how to specify that I want that which contains [] ?
Thank you
The way with Html Agility Pack:
using HtmlAgilityPack;
...
string htmlText = #"<p style=""line-height:0;text-align:left"">
...";
HtmlDocument html = new HtmlDocument();
html.LoadHtml(htmlText);
HtmlNode doc = html.DocumentNode;
HtmlNodeCollection nodes = doc.SelectNodes("//font[.//text()[contains(substring-after(., '['), ']')]]");
if (nodes != null)
{
foreach (HtmlNode node in nodes)
{
Console.WriteLine(node.OuterHtml);
}
}
In general, you shouldn't use regexes for HTML—there are generally many much better ways to do it. However, in some isolated cases, it works perfectly fine. Assuming this is one of those cases, here's how to do it with regex.
Making regexes is often easy when you think of it this way: write what you want to match, and then replace parts of it with regex as necessary.
We want to match
<font face="AR BLANCA">
<span style="font-size:20pt;line-height:30px;">
[désignation]
</span>
</font>
We don't care what face="AR BLANCA"> <span style="font-size:20pt;line-height:30px;">, désignation, and </span> are, so replace them with .*.
<font .*[.*].*</font>
We also have to make sure that you escape all the special characters, otherwise [.*] will be mistaken for a character class.
<font .*\[.*\].*</font>
We also want to match all characters, but most of the time a . only matches non-newline characters. [\S\s] is a character class that by definition matches all characters.
<font [\S\s]*\[[\S\s]*\][\S\s]*</font>
We finally have one last problem—this regex will match from the very first <font to the last </font>. With your HTML example, making the quantifier lazy will not help it, so we need to do something else. The best way to do this that I know of is to use the trick explained here. So we replace each instance of [\S\s]* with ((?!</?font)[\S\s])*.
<font ((?!</?font)[\S\s])*\[((?!</?font)[\S\s])*\]((?!</?font)[\S\s])*</font>
Here's an online demonstration of this regex.

regex to replace HTML sorrounding tag

im trying to replace an html tag with another one using notepad++ search and replace.
i would like this:
<strong style="font-size: 1em;"><br />some text</strong>
to become this
<h3>some text</h3>
so far i have reached this:
<strong style="font-size: 1em;"\s(.*?)><br />(.*?)</strong>
and am not sure what to put inside "replace with", is this ok:
<h3>$1</h3>
?
Thanks
Try this as the replacement pattern.
<h3>\2</h3>
You can reference capture groups (between parenthesis) in the regex by \n where n is the number of the group.
The regex should be this for catching...
<strong style="font-size: 1em;"\s?(.*?)><br />(.*?)</strong>
this \s should be optional according to your html

Regular Expression change text between tags

I have some code in the following layout,i m using textcrawler to do a find and replace
<a>
Name=LineA
epsium
ask
answer
line=10
color=red
</a>
<a>
Name=LineB
Color=Blue
</a>
...
Now the question is what regular expression i need to use so as to remove the second block of code between <a> and </a>
<a>(\s*?Name\=LineB[\S\s]*?)</a>
It captures all text between and including the <a></a> tags that starts with the text Name=LineB.
In Perl, I'll do :
$str =~ s~^(.*?<a>.*?</a>.*?)<a>.*?</a>(.*)$~${1}New text$2~s;
the first group contains everything before the second block <a></a> and the second group everything after.
In php:
$str = preg_replace('~^(.*?<a>.*?</a>.*?)<a>.*?</a>(.*)$~', "${1}New text$2", $str);
preg_replace("/<body>([\s\S]*.*)<\/body>/",$replace,$origional);
this will replace whole content between body tags.

Regular Expression doesn't match

I've got a string with very unclean HTML. Before I parse it, I want to convert this:
<TABLE><TR><TD width="33%" nowrap=1><font size="1" face="Arial">
NE
</font> </TD>
<TD width="33%" nowrap=1><font size="1" face="Arial">
DEK
</font> </TD>
<TD width="33%" nowrap=1><font size="1" face="Arial">
143
</font> </TD>
</TR></TABLE>
in NE DEK 143 so it is a bit easier to parse. I've got this regular expression (RegexKitLite):
NSString *str = [dataString stringByReplacingOccurrencesOfRegex:#"<TABLE><TR><TD width=\"33%\" nowrap=1><font size=\"1\" face=\"Arial\">(.+?)<\\/font> <\\/TD>(.+?)<TD width=\"33%\" nowrap=1><font size=\"1\" face=\"Arial\">(.+?)<\\/font> <\\/TD>(.+?)<TD width=\"33%\" nowrap=1><font size=\"1\" face=\"Arial\">(.+?)<\\/font> <\\/TD>(.+?)<\\/TR><\\/TABLE>"
withString:#"$1 $3 $5"];
I'm no an expert in Regex. Can someone help me out here?
Regards, dodo
Amarghosh, and bobince, the winning answerer of linked question, is generally right about this. However, since you are just sanitising, regexps are actually just fine.
First, strip the tags:
s/<.*?>//
Then collapse all extra spaces into one:
s/\s+/ /
Then remove leading/trailing space:
s/^\s+|\s+$//
Then get the values:
^([^ ]+) ([^ ]+) ([^ ]+)$
I have a few suspicions about why your regex might fail (without knowing the rules for string escaping in the iPhone SDK): The dot . used in places where it would have to match newlines, the slash looks like it's escaped unnecessarily etc.,
but: in your example, the text you're trying to extract is characterized by not being surrounded by tags.
So a search for all occurences of (?m)^[^<>\r\n]$ should find all matches.
If you sure of your html-code hierarchy, then you can just extract text enclosed by font-tags:
Regex r = Regex(#"<\s*font((\s+[^<>]*)|(\s*))>(?<desiredText>[^<>]*)<\s*/\s*font\s*>")
//C# example
foreach(Match m in r.Matches(txt))
result += m.Groups["desiredText"].Value.Trim()
;
It will be text enclosed by font-tags without white-space symbols by edges.