I am trying to workout a solution to get a regexmatch to a string that are in between 2 text, I came up with a working solution of
RegExMatch(clipboard,"STUFF<p class=""figure"">(.*)</p></div><",match)
that matches the things from (gave me 2140)
<div><div>STUFF<p class="figure">2140</p></div></div>
but this is all in a single line, I have no idea how do I apply this to a code that are multiple line, such as
<tr>
<td>Qty</td>
<td>:</td>
<td>
310
</td>
</tr>
</table>
I would like to get 310, how should my regex be? I couldn't figure out how, I've tried with \s\s but to no avail. Please help
EDIT
I seem to get a hang of \s* function now, I tried it slowly to see where it omits, as I go on with
<td>Qty</td>\s*<td>(.*?)</td>
it gave me :
but I couldn't get it pass the <td>:</td> part, it always returns blank, I am wondering if my (.?) should be (.\s?) instead?
I tried
RegExMatch(clipboard,"<td>Qty</td>\s*<td>(.\s*?)</td>\s*</tr>",match)
but to no avail
The basic idea is that you add \s* where ever you think a space might come. It will match zero or space like characters. (like tabs, new lines etc). For example,
\s*<tr>\s*<td>Qty</td>\s*<td>:</td>\s*<td>(.*?)</td>/s*</tr>
Related
Have I found a bug in Notepad++ or am I doing something wrong?
Background info
(Please note that I do know that one are supposed not to use Regex parsing HTML, but I think this is a special case that should work - without the possible Notepad++ bug ;-)
I have exported Apple Notes as HTML using Exporter 3.0 on a Mac. In the HTML output every Note line is between <div> - </div> elements and also "header/title lines" like <h1> - </h1> or <h2> - </h2> etc. Each "header/title line" is often split in several unnecessary HTML header elements as in the following simplified example.
<div><h1>TEST </h1><h1>Title<br></h1></div>
<div><b><h2>T1</h2><u><h2>T2</h2></u><h2> </h2></b><h2>(</h2><h2>T3</h2><u><h2>T4</h2></u><h2>)</h2><b><h2><br></h2></b></div>
This HTML can't be imported into OneNote giving the same result as seen in Apple Notes i.e. each "header/title" line is split in multiple lines. That's true even when changing the <h1>/<h2> block elements to inline elements using an initial <style>h1, h2 {display: inline;}</style> statement. (Maybe that is a bug or restriction in OneNote, but I need to find a workaround.)
Therefore, I need to clean the example HTML output above from the unnecessary HTML header <h1> or <h2> (all but the first in every line) and </h1> or </h2> (all but the last in every line), to get the following result that can be imported to OneNote without problem.
<div><h1>TEST Title<br></h1></div>
<div><b><h2>T1<u>T2</u> </b>(T3<u>T4</u>)<b><br></h2></b></div>
Solution ? - Developed Regex
I'm quite new to Regex, especially advanced Regex, but I think I have found a way to clean the erroneous HTML code using TWO different Regex expressions as follows.
Both works well when tested using regex101.com, I think.
The first one is used to remove unnecessary </h1> or </h2> elements and is a Positive Lookahead function (it works both in regex101 and in Notepad++)
(</h[1-6]>)(?=.*?\1)
(Demo)
Picture 1 shows a working Find All + Mark All in Notepad++
Picture 2 shows a working Replace All
The Second one used to remove unnecessary <h1> or <h2> elements and is a Positive Lookbehind function (it works in regex101 but NOT fully in Notepad++)
(?<=(<(h[1-6])>))(?:.*?)\K\1
(Demo)
Picture 3 shows a working Find All + Mark All in Notepad++ = All 8 occurrences found
Picture 4 shows a NOT working Replace All in Notepad++ = Only 5 occurrences (of the 8 found) are replaced
If I redo the same Replace All a second time 2 of the
remaining 3 occurrences are replaced.
If I redo the same Replace All a third time the last
remaining occurrence is replaced.
BUG ?
Is this a bug in Notepad++ or is this behavior normal or am I doing something strange here? Please help me understand.
So, rather than make multiple passes through your data, you can get it all in one pass with this:
(^.*?<h[1-6]>)?(.*?)</?h[1-6]>(?=.*</h[1-6]>.*?$)
and replace it with \1\2. The first capture group skips the first <h#> on each line and is null after line start. The second capture group captures everything up to the next <h#> tag. The optional slash (/?) scans and deletes both open and close tags. The last part is a positive lookahead to make sure the last </h#> is not deleted.
In the two lines of your examples all the header levels were the same on the line and this regex is fine. If the first open and last close don't match, then you have a problem but I think your solutions also have that same problem. In any case you can fix that in a second pass with ^(.*<h)([1-6])(.*<h)[1-6] and replace it with \1\2\3\2.
I would also point out that this creates unbalanced HTML with a <b>, followed by <h1>, followed by </b>, followed by </h1>. I don't know if that is OK for your case. If not, it might be better to remove ALL the <h#> tags and anchor new ones just inside the <div> </div> pair.
In any event here is a REGEX101 screenprint with this regex working on your examples:
I have the below regex to identify text in a html tag that doesn't yields the result expected.
HTML Tag:
<td>Issue Amount</td>
<td>:</td>
<td>20,000,000.00</td>
Find = re.findall(?<=Issue Amount</td> <td>:</td> <td>) [0-9,]),soup_string)[0]
I need to get the numerical value 20,000,000.00 from this tag.
Any advise what am I doing wrong here. I did try couple of other ways but with no success.
Do not under any circumstances try to parse XML with a regex unless you wish to invoke rite 666 Ph'nglui mglw'nafh Cthulhu R'lyeh wgah'nagl fhtagn.
Use an HTML parsing library see this page for some ways to do it.
However in your case you have mucked up your regex by looking for a space between your </td> and <td> tags. Whereas your data has carriage returns. You can use the \s meta-character to look for any white space character
Below is the regex piece that helped me get the desired output. Thanks all for your inputs.
(?<=Issue Amount[td\W]{21})([\d,.]+)
I have some html page, it looks like:
<span>Some text</span>
<p>And again</p>
<table>
<thead>
<tr>
<th>Text</th>
<th>Text [some text]</th>
<th>Text</th>
</tr>
</thead>
<tbody>
<!--[content-->
<tr>
<td>again some txt but with [this]</td>
<td>in this td the same situation [oops]</td>
<td>hello [world]</td>
</tr>
<!--content]-->
</tbody>
</table>
<span>here is [the text]</span>
I need to take text from square brackets, but just in commented fields. I have 2 reg exp and they are work fine, but separately.
/[^[\]]+(?=])/g - this is for text in brackets;
(?=<!--\[content)([\s\S]*?content]-->) - for commented fields.
But I can't combine it. I was trying this (?=<!--\[content)([^[\]]+(?=]))([\s\S]*?content]-->) but it's not works. I don't know much regexp, how can I combine it?
UPD: for output I need text in brackets only between commented fields (this, oops, world).
First, I might start from some simple one:
(?<=\[)[^\]\[]*(?=\])(?=[\s\S]*?<!--content\]-->)
Explanation
(?<=\[)[^\]\[]*(?=\]) match text inside any square brackets,
(?=[\s\S]*?<!--content\]-->) lookahead to any string that followed by a closed content tag.
Its sound so make sense right! BUT anyway, check this out DEMO1. yeah...it didn't work. So, the question is why???
In the regex above there is still some problem about the lookahead assertion, as I mentioned before in the previous explanation:
(?=[\s\S]*?<!--content\]-->) lookahead to any string that followed by a closed content tag.
This is WRONG, it should be:
(?=[\s\S]*?<!--content\]-->) lookahead to any string that followed by any open or closed content tags.
So, the conclusion our issue is the regex [\s\S]*? sometimes it just matches "more than one content tags".
Workaround
To prevent the above issue, we can put another negative lookaheads of the open content tags to be coupled with every characters that will be generated by [\s\S]*. Thus, we get:
(?<=\[)[^\]\[]*(?=\])(?=(?:(?!<!--\[content-->)[\s\S])*?<!--content\]-->)
Notice that
[\s\S]*
is just modified to
(?:(?!<!--\[content-->)[\s\S])*?
which means (?!<!--\[content-->) is spawned to be in front of every characters that generated by [\s\S]*. For example if [\s\S]* generates ABCDEF..., the negative lookahead will be spawned in this way:
(?!<!--\[content-->)A(?!<!--\[content-->)B(?!<!--\[content-->)C(?!<!--\[content-->)D(?!<!--\[content-->)E(?!<!--\[content-->)F...
Finally, please check the DEMO2. See that right? it's just work!
DISCLAIMER: My regex here will be work fine under only the simple examples that you were provided on the question. For the another complex such as some recursive structure, I can not guarantee that.
I'm trying to assign a 6-digit sequence which lays in <pre>-node to a variable using "store" command with XPath and regex, but something is wrong with my approach.
Sample text from <pre>:
"OPERACIA, KOD PODTVERZDENIA 021477"
Command:
store(//table[#id='sms_table']/tbody/tr/td/pre[matches(text(),'[0-9]{6}')], foo)
First thing to note, you should be using storeText, not store. Store will only record what you put in the target field, it won't look for the locator on the page. Also, the way you've done your regex ([0-9]{6}) won't give you what you'd need. That would look for a digit from 0-9 followed by 6 more digits.
I've recently had to do pretty much the same thing, the way I did it is separated this out into 2 commands, rather than trying to process it all in one go. so first command, store the full thing, second command, Regex to pull out the 6 digits. Like below
<tr>
<td>storeText</td>
<td>//table[#id='sms_table']/tbody/tr/td/pre</td>
<td>Text</td>
</tr>
<tr>
<td>storeEval</td>
<td>storedVars['Text'].match(/\d{6}/)</td>
<td>digits</td>
</tr>
I am trying to pull some info here is my regex
<tr>
<td>([^<]+)<i><a href="([^<]+)" title="([^<]+)">([^<]+)<\/a><\/i><sup id="([^<]+)" class="([^<]+)"><a href="([^<]+)"><span>[<\/span>1<span>]<\/span><\/a><\/sup><\/td>
<td><a href="([^<]+)" title="([^<]+)">([^<]+)<\/a><\/td>
<td><a href="([^<]+)" title="([^<]+)">([^<]+)<\/a><\/td>
<td>([^<]+)<\/td>
<td>([^<]+)<\/td>
</tr>
here is sample html
<tr>
<td><i>3Xtreme</i><sup id="cite_ref-18" class="reference"><span>[</span>18<span>]</span></sup></td>
<td>989 Studios</td>
<td>989 Studios</td>
<td>1999-03-31<sup>NA</sup></td>
<td>NA</td>
</tr>
As of now i just want to get the data to find matches.. Can you see any reason why it would not match this?
for all the haters....
I dont care about your options on if i should use regex on html or not.. For this case it will work great. I have one page , the data i need is in a table. Once i can get the data i will save it to my db and never have to use the regex again.. Soooo if your comment or answer is about your option on using regex with html.. dont post.
...Second line:
<td>([^<]+)<i>
cannot hope to match:
<td><i>
as you put a '+' equivalent to '{1,}' while there is nothing between your tags. Didn't check the rest of your regex, but anyway it can't work.
Edit:
Please also correct the "([^<]+)" and so on (I hope you see why)... And edit your regex when you correct it.
Edit 2:
Seeing as it's quite a disaster (sorry but it's the truth :/): please consider replacing all your ([^<]+) things that won't work for all your cases by a simple (.*?)
Edit 3:
[ and ] must be escaped. (\d will help you catch numbers)
<span>[<\/span>1<span>]<\/span>
Lots of problems here: you must escape the brackets and obviously 1 won't match 18