Notepad++ search & replace - regex

I'm trying to convert a html file with 100 of entries like this one:
<table>
<tr>
<td valign="top" width="30">
1.</td>
<td>
TEXT DESCRIPTION
</td>
</tr>
</table>
<table><tr><td></td></tr></table>
where the number "1." goes from 1 to 100, into this:
<li>
TEXT DESCRIPTION
</li>
I haven't find a way to do this, neither with regexp nor with extended search mode. Any ideas?

You could start with this:
Replace
.*<td>(.*[A-Za-z]+.*)<\/td>.*
with
<li>\1</li>
This will match one chunk of code of the form you reported. You must modify it to match multiple chunks of the same form in the same file.
Moreover to work correctly we should make it match lazily. Someone who knows how?

Related

Regular Expression to match text between HTML tags

I have to parse a table in HTML code using Regex. However, the code for these tables can differ because they come from different sources. But, they all have in common that they only use 2 columns. So what I want to do is to match all text that is not encapsulated within the '<' and '>' symbols.
Moreover, I want to name the two columns / groups in Regex.
I have this table row for example:
<tr>
<td width="313" valign="top" style="width:234.9pt;border:solid windowtext 1.0pt;padding:0cm 5.4pt 0cm 5.4pt">
<p class="MsoNormal"><o:p>Company</o:p></p>
</td>
<td width="313" valign="top" style="width:234.9pt;border:solid windowtext 1.0pt;border-left:none;padding:0cm 5.4pt 0cm 5.4pt">
<p class="MsoNormal">TestCompany<o:p></o:p></p>
</td>
</tr>
For which I only want to select 'Company' and 'TestCompany' and name these matches as 'Key' and 'Value' respectively.
I came up with the following Regex:
https://regex101.com/r/09fpbz/2
However, this also selects odd tags like </o:p> and spaces/new lines.
If <o:p> are always like in your exemple it will work.
<tr>(?:[^<]*<){3}[^>]+>(?<Key>[\w\w]+)(?:[^>]*>){5}(?<Value>[\w\s]+)

Collect results from groups to one string

I want to parse weather html page for Openhab.
This is significant part of whole html:
<!-- Amount of Sun -->
<tr>
<td class="label_det">
<span class="sum">∑</span> <span class="unit">in u</span>
</td>
<td class="sunamount">
10.2
</td>
<td class="sunamount">
10.6
</td>
<td class="sunamount">
5.9
</td>
<td class="sunamount">
6.8
</td>
<td class="dgrey sunamount">
6.8
</td>
<td class="dgrey sunamount">
5.4
</td>
<td class="sunamount">
5
</td>
</tr>
I would like to collect all numbers into one string, I understand that it's, perhaps, not possible, but may be...
So something like this: '10.2 10.6 5.9 6.8 6.8 5.4 5'
Example of full html and my current regex is here: https://regex101.com/r/nrzPHU/1
Thanks in advice.
You need named capture groups. Named capture groups allow you to specify a given part in regex with a name to extract it later. A named capture group starts with (? then followed by the regex and ended with ).
<td class\=\".*?sunamount\">\s+(?<amount>\d+(\.\d+)?)\s+<\/td>
You would then be able to extract the amount by applying your regex to the input and picking the group named amount out of it.
Reading about OpenHab online I'm not sure they support named capture groups. So an alternative would be using the regex above to match all lines with amounts in the input. Then using a regex replace on that matched string. So something like...
Use this regex to get amounts:
<td class\=\".*?sunamount\">\s+\d+(\.\d+)?\s+<\/td>
Use this regex on the result of the regex above to replace the non amounts (and replace them with an empty string to delete them):
([\s]|<td class=".*?">|<\/td>)

Find a table's last cell by regular expression

I want to use Regular Expression (compatible with pcre) to select a table
cell in an XML or HTML file.This cell was expanded in several lines containing
other elements and relative attributes and values. Thiscell supposed to be at the last column.
for some reasons I can't and don't want to use ". matches newline" option.
for example in this code:
EDITED:
<table colcount="4">
<tr>
<td colspan="2">
<para><text> Mike</text></para>
</td>
<td>
<tab />
</td>
<td1>
<para><text>Jack</text></para>
<para><text>Sarah</text></para>
</td>
</tr1>
<tr>
<td>
<para><text>Bob</text></para>
<para><text>Rita</text></para>
</td>
<td2 colspan="3" with>
<para><text>Helen</text></para>
</td>
</tr2>
<tr>
<td style="with:445px;">
<para><text>Sam</text></para>
</td>
<td>
<para><text>Emma</text></para>
<para><text>George</text></para>
</td>
<td>
</td>
<td3 colspan="">
<tab />
</td>
</tr3>
</table>
/EDITED
I want to find and select the whole last cell together with its start and end tags (<td and </td>)
and the end tag of the corresponding row(</tr>), that is:
EDITED:
Here is what I want to select in the table like above using RegEx:
Either from <td1 to </tr1> - or from <td2 to </tr2> - or from <td3 to </tr3>
/EDITED
The format (indentation and new lines have to be preserved), I mean I can't put, for example
</tr> in front of of closing tag of the cell(</td>).
Indentation is only space character.
Thanks for any help...
Best you can do with regex is:
<td(([^<]|<(?!\/td>))*)<\/td>\s*<\/tr>(?!(.|\r|\n)*<tr)
But this is kinda ugly, resource intensive and breaks when you have nested tables. A better route is indeed to use an XML or HTML parser for whichever programming language you're using.
If you want to select the last cell from EVERY row, as your updated question suggests, leave out the negative lookahead like so:
<td(([^<]|<(?!\/td>))*)<\/td>\s*<\/tr>
Working example here: http://refiddle.com/gt2

replacing special placeholders in HTML file with Qt

Good day to everybody!
I have this sort of HTML file:
<tr>
<td>
<p>First name: </p>
</td>
<td>
<p> %first_name% </p>
</td>
</tr>
<tr>
<td>
<p>Last name: </p>
</td>
<td>
<p"> %last_name% </p>
</td>
</tr>
I'm looking for a way of replacing special markers of type(%smth%) by concrete data. Project's being developed under Qt, so I wonder if some Qt's methods can do it.
Thanks!
The simplest solution might be using QString & QString::replace ( const QString & before, const QString & after, Qt::CaseSensitivity cs = Qt::CaseSensitive ) which replaces every occurrence of the string before with the string after and returns a reference to this string.
Place the contents of your html file into a QString then call QString::replace() to replace the special markers by concrete data. For example:
QString firstName("John");
html.replace("%first_name%", firstName);
As far as you can not use regexps, I recommend using
XSLT which supported by xmlpatterns library.
EDIT
As someone thinks he still can parse html with regexp in this case, I will give some examples, that will show regexps fail:
You have marker in attribute (and you don't want it to be replaced)
<p class="%first name">
Someone would deside to inject:
map: %firstname -> <srcipt language ="javascript">....</script>
After XSLT substitution will be escaped automatically.

How to use a regular expression to extract a substring?

so im trying to figure out regular expressions in Flex, but cant for the life of me figure out how to do the following.
From the sample below, i need to extract out only "Mike_Mercury".
So i have to somehow strip out everything around it with RegExp, or whatever's best. Also, I would need it to work with other samples as well. Im getting this from the reddit api, so id have to extract that same section from a whole bunch of these. Thanks!
<table>
<tr>
<td>
<a href="http://www.reddit.com/r/atheism/comments/q2sfe/barack_obamas_insightful_words_on_abortion/">
<img src="http://d.thumbs.redditmedia.com/9StfiHi7hEbf8v73.jpg" alt="Barack Obama's insightful words on abortion"
title="Barack Obama's insightful words on abortion" /></a>
</td>
<td>
submitted by Mike_Mercury
to atheism
<br />
[link] <a href="http://www.reddit.com/r/atheism/comments/q2sfe/barack_obamas_insightful_words_on_abortion/">
[1722 comments]</a>
</td>
</tr>
</table>
Try this regex:
submitted by (.*?)