Regular Expression doesn't match - regex

I've got a string with very unclean HTML. Before I parse it, I want to convert this:
<TABLE><TR><TD width="33%" nowrap=1><font size="1" face="Arial">
NE
</font> </TD>
<TD width="33%" nowrap=1><font size="1" face="Arial">
DEK
</font> </TD>
<TD width="33%" nowrap=1><font size="1" face="Arial">
143
</font> </TD>
</TR></TABLE>
in NE DEK 143 so it is a bit easier to parse. I've got this regular expression (RegexKitLite):
NSString *str = [dataString stringByReplacingOccurrencesOfRegex:#"<TABLE><TR><TD width=\"33%\" nowrap=1><font size=\"1\" face=\"Arial\">(.+?)<\\/font> <\\/TD>(.+?)<TD width=\"33%\" nowrap=1><font size=\"1\" face=\"Arial\">(.+?)<\\/font> <\\/TD>(.+?)<TD width=\"33%\" nowrap=1><font size=\"1\" face=\"Arial\">(.+?)<\\/font> <\\/TD>(.+?)<\\/TR><\\/TABLE>"
withString:#"$1 $3 $5"];
I'm no an expert in Regex. Can someone help me out here?
Regards, dodo

Amarghosh, and bobince, the winning answerer of linked question, is generally right about this. However, since you are just sanitising, regexps are actually just fine.
First, strip the tags:
s/<.*?>//
Then collapse all extra spaces into one:
s/\s+/ /
Then remove leading/trailing space:
s/^\s+|\s+$//
Then get the values:
^([^ ]+) ([^ ]+) ([^ ]+)$

I have a few suspicions about why your regex might fail (without knowing the rules for string escaping in the iPhone SDK): The dot . used in places where it would have to match newlines, the slash looks like it's escaped unnecessarily etc.,
but: in your example, the text you're trying to extract is characterized by not being surrounded by tags.
So a search for all occurences of (?m)^[^<>\r\n]$ should find all matches.

If you sure of your html-code hierarchy, then you can just extract text enclosed by font-tags:
Regex r = Regex(#"<\s*font((\s+[^<>]*)|(\s*))>(?<desiredText>[^<>]*)<\s*/\s*font\s*>")
//C# example
foreach(Match m in r.Matches(txt))
result += m.Groups["desiredText"].Value.Trim()
;
It will be text enclosed by font-tags without white-space symbols by edges.

Related

Regexp: How can I clean html tags of attributes but a few in Notepad++

Okay, in Notepad++ I want to clean html tags with lots of style attributes, e.g.
<td style="width: 457.4pt; border: solid windowtext 1.0pt; background: #BFBFBF; padding: 0cm 5.4pt 0cm 5.4pt;" colspan="2" valign="top" border="1" cellspacing="0" cellpadding="0">
and I want this as an outcome:
<td colspan="2" valign="top">
So far I am at this for search and replace:
<([a-z][A-Z]*)[^>]*?>
<$1>
which cleans all attributes of a html tag. But I want to keep colspan and valign. How do I have to modify the expression?
This can be error prone, but for the example string you might use a branch reset group to use 2 capture groups in different order and capture the name of the tag also in a group.
Find what
<(\w+)[^<>]*(?|(\bcolspan="[^"]*")[^<>]*(\bvalign="[^"]*")|(\bvalign="[^"]*")[^<>]*(\bcolspan="[^"]*"))[^<>]*>
Replace with
<$1 $2 $3>
Explanation
< Match literally
(\w+) Capture 1+ word chars in group 1
[^<>]* Optionally match any char except < and > (Assuming no nested tag etc..)
(?| Branch reset group
(\bcolspan="[^"]*")[^<>]*(\bvalign="[^"]*") Caputure first colspan, then valign
| Or
(\bvalign="[^"]*")[^<>]*(\bcolspan="[^"]*") Or capture both the other way around
) Close group
[^<>]* Optionally match any char except < and >
> Match literally
Regex demo
Assuming both colspan and valign are present, maybe:
<td([^<>]*?(\h(?:colspan|valign)="\w+"))((?1))[^<>]*>
Replace with <td$2$3>, see an online demo.

How to make a regex expression for this content?

I am scraping a website which looks like this, and I am looking for 4 / 5 and 3 / 10. That is, I want (number) + space + slash + 3 spaces + another number.
I tried the regex expression ^[0-9]+(\/[0-9]+)" *"*$ but that did not work.
<td>Monday</td>
<td class="text-center text-danger font-weight-bold">4 / 5</td>
</td>
<td>Tuesday</td>
<td class="text-center text-danger font-weight-bold">3 / 10</td>
</td>
You were close. Use word boundary \b instead of ^ and $, because the text you are looking for is somewhere in the middle of your text. This regex should work:
/\b[0-9]+ +\/ +[0-9]+\b/
The + makes the regex more forgiving, by requiring at least one space.
If you want to capture the numbers separately you can introduce capture groups, to reference them with $1 and $2, respectively:
/\b([0-9]+) +\/ +([0-9]+)\b/
$ grep "[[:digit:]]\{1,\} \/ [[:digit:]]\{1,\}" filename
<td class="text-center text-danger font-weight-bold">4 / 5</td>
<td class="text-center text-danger font-weight-bold">3 / 10</td>
$
You have started the regex with ^ which is to anchor the regex to beginning of line.

Regex, How to extract a delimited string and containing some special words?

From the following html script:
<p style="line-height:0;text-align:left">
<font face="Arial">
<span style="font-size:10pt;line-height:15px;">
<br />
</span>
</font>
</p>
<p style="line-height:0;text-align:left">
<font face="AR BLANCA">
<span style="font-size:20pt;line-height:30px;">
[designation]
</span>
</font>
</p>
<p style="line-height:0;text-align:left">
</p>
I want to extract the following part
<font face="AR BLANCA">
<span style="font-size:20pt;line-height:30px;">
[désignation]
</span>
</font>
I tried this regular expression :
<font.*?font>
this could extract separatly two matches, but how to specify that I want that which contains [] ?
Thank you
The way with Html Agility Pack:
using HtmlAgilityPack;
...
string htmlText = #"<p style=""line-height:0;text-align:left"">
...";
HtmlDocument html = new HtmlDocument();
html.LoadHtml(htmlText);
HtmlNode doc = html.DocumentNode;
HtmlNodeCollection nodes = doc.SelectNodes("//font[.//text()[contains(substring-after(., '['), ']')]]");
if (nodes != null)
{
foreach (HtmlNode node in nodes)
{
Console.WriteLine(node.OuterHtml);
}
}
In general, you shouldn't use regexes for HTML—there are generally many much better ways to do it. However, in some isolated cases, it works perfectly fine. Assuming this is one of those cases, here's how to do it with regex.
Making regexes is often easy when you think of it this way: write what you want to match, and then replace parts of it with regex as necessary.
We want to match
<font face="AR BLANCA">
<span style="font-size:20pt;line-height:30px;">
[désignation]
</span>
</font>
We don't care what face="AR BLANCA"> <span style="font-size:20pt;line-height:30px;">, désignation, and </span> are, so replace them with .*.
<font .*[.*].*</font>
We also have to make sure that you escape all the special characters, otherwise [.*] will be mistaken for a character class.
<font .*\[.*\].*</font>
We also want to match all characters, but most of the time a . only matches non-newline characters. [\S\s] is a character class that by definition matches all characters.
<font [\S\s]*\[[\S\s]*\][\S\s]*</font>
We finally have one last problem—this regex will match from the very first <font to the last </font>. With your HTML example, making the quantifier lazy will not help it, so we need to do something else. The best way to do this that I know of is to use the trick explained here. So we replace each instance of [\S\s]* with ((?!</?font)[\S\s])*.
<font ((?!</?font)[\S\s])*\[((?!</?font)[\S\s])*\]((?!</?font)[\S\s])*</font>
Here's an online demonstration of this regex.

Remove all tabs, blank/brake/new lines, empty lines, multiple successive spaces except single space character

I have a HTML code that looks like this:
<TABLE>
<TR>
<TD>Item</TD>
<TD><A>48</A>
</TD></TR>
<TR>
<TD>Item</TD>
<TD><A >48</A>
</TD></TR>
<TR>
<TD>Tags</TD>
<TD><A>
keyword</A>, <A>keyword
</A>, <A>keyword
</A>, <A>keyword</A>, <A
>keyword</A>, <A
>keyword
</A>, <A>keyword
</A>
</TABLE>
Using .NET regex, can anyone help me to remove ALL whitespace characters EXCEPT single space characters from it so that I will end up with one long string of code?
You can use this regex:
[\p{Z}\s]{2,}
This will check if there are at least 2 whitespace characters. Replace with empty string if found.
\p{Z} stands for All Separators Unicode shorthand class.
See demo
This is possible with the following regex,
\s{2,} // \s will match all whitespaces, and {2,} tells it, there needs to be more then 1
You can use it in c# like this:
string output = Regex.Replace(input, #"\s{2,}", "");
Effect:

Regex to replace &nbsp with

I need to replace &nbsp with but with
My regex is &nbsp([^;])
my replacement is
My code is replacing
<pre><code>
<ul><b> </b>&nbsp</ul>
<tr valign="top"><td width="144"> United Kingdom</td><td width="144"> Pound</td><td width="144"> Pence</td></tr></pre></code>
with
<pre><code>
<ul><b> </b> /ul>
<tr valign="top"><td width="144"> United Kingdom</td><td width="144"> Pound</td><td width="144"> Pence</td></tr>
It is removing the < tag in </ul></pre></code>
Any help?
try the regex with a lookahead
(&nbsp)(?!;)
Use lookaround
(?<=&nbsp)(?=[^;])
OR
(?<=&nbsp)(?!;)
replace it with ;
It would probably be better to capture the non ; character and adding ; infront of it.
Search string:
(&nbsp)([^;])
Replace string:
\1;\2