SLRE regex doesn't work properly - c++

I have a problem with SLRE library, I can't figure out how to stop grabbing everything after my match. Let's say I have a html output and somewhere in the middle of buffer there is line I want to parse
name="id" value="1a2b3c4d5e6f" />
Here is my regular expression
slre_compile(&test, "name=\"id\" value=\"(.*?)\" />")
I have read about greedy and non-greedy flags in other threads where people used to have similar problem as me, but in my case adding ? to the expression doesn't change anything.
SLRE returns me match starting from 1a2b3c4d5e6f" /> and shows rest of the html page ending on </html> tag, just I don't know why. It is cutting the beginning of the html source but leaves everything after my expression. I have also tried following regex
slre_compile(&test, "^.*?name=\"id\" value=\"(.*?)\" />.*?$")
and some others, modified with greedy and non-reedy flags, which gave me same results. Does anyone know why SLRE can't stop at " /> and continues capturing characters till the source string ends?

it seems that SLRE does not understand non-greedy qualifiers and parses .*? instead as if it were (?:.*)?. However, in this case \"[^\"]*\" should work...

Related

Possible Bug using Regex in Notepad++ with Replace All?

Have I found a bug in Notepad++ or am I doing something wrong?
Background info
(Please note that I do know that one are supposed not to use Regex parsing HTML, but I think this is a special case that should work - without the possible Notepad++ bug ;-)
I have exported Apple Notes as HTML using Exporter 3.0 on a Mac. In the HTML output every Note line is between <div> - </div> elements and also "header/title lines" like <h1> - </h1> or <h2> - </h2> etc. Each "header/title line" is often split in several unnecessary HTML header elements as in the following simplified example.
<div><h1>TEST </h1><h1>Title<br></h1></div>
<div><b><h2>T1</h2><u><h2>T2</h2></u><h2> </h2></b><h2>(</h2><h2>T3</h2><u><h2>T4</h2></u><h2>)</h2><b><h2><br></h2></b></div>
This HTML can't be imported into OneNote giving the same result as seen in Apple Notes i.e. each "header/title" line is split in multiple lines. That's true even when changing the <h1>/<h2> block elements to inline elements using an initial <style>h1, h2 {display: inline;}</style> statement. (Maybe that is a bug or restriction in OneNote, but I need to find a workaround.)
Therefore, I need to clean the example HTML output above from the unnecessary HTML header <h1> or <h2> (all but the first in every line) and </h1> or </h2> (all but the last in every line), to get the following result that can be imported to OneNote without problem.
<div><h1>TEST Title<br></h1></div>
<div><b><h2>T1<u>T2</u> </b>(T3<u>T4</u>)<b><br></h2></b></div>
Solution ? - Developed Regex
I'm quite new to Regex, especially advanced Regex, but I think I have found a way to clean the erroneous HTML code using TWO different Regex expressions as follows.
Both works well when tested using regex101.com, I think.
The first one is used to remove unnecessary </h1> or </h2> elements and is a Positive Lookahead function (it works both in regex101 and in Notepad++)
(</h[1-6]>)(?=.*?\1)
(Demo)
Picture 1 shows a working Find All + Mark All in Notepad++
Picture 2 shows a working Replace All
The Second one used to remove unnecessary <h1> or <h2> elements and is a Positive Lookbehind function (it works in regex101 but NOT fully in Notepad++)
(?<=(<(h[1-6])>))(?:.*?)\K\1
(Demo)
Picture 3 shows a working Find All + Mark All in Notepad++ = All 8 occurrences found
Picture 4 shows a NOT working Replace All in Notepad++ = Only 5 occurrences (of the 8 found) are replaced
If I redo the same Replace All a second time 2 of the
remaining 3 occurrences are replaced.
If I redo the same Replace All a third time the last
remaining occurrence is replaced.
BUG ?
Is this a bug in Notepad++ or is this behavior normal or am I doing something strange here? Please help me understand.
So, rather than make multiple passes through your data, you can get it all in one pass with this:
(^.*?<h[1-6]>)?(.*?)</?h[1-6]>(?=.*</h[1-6]>.*?$)
and replace it with \1\2. The first capture group skips the first <h#> on each line and is null after line start. The second capture group captures everything up to the next <h#> tag. The optional slash (/?) scans and deletes both open and close tags. The last part is a positive lookahead to make sure the last </h#> is not deleted.
In the two lines of your examples all the header levels were the same on the line and this regex is fine. If the first open and last close don't match, then you have a problem but I think your solutions also have that same problem. In any case you can fix that in a second pass with ^(.*<h)([1-6])(.*<h)[1-6] and replace it with \1\2\3\2.
I would also point out that this creates unbalanced HTML with a <b>, followed by <h1>, followed by </b>, followed by </h1>. I don't know if that is OK for your case. If not, it might be better to remove ALL the <h#> tags and anchor new ones just inside the <div> </div> pair.
In any event here is a REGEX101 screenprint with this regex working on your examples:

Why does a regular expression find a match outside of it's bounds?

I have the following regular expression, which I'm using to find <icon use="some-id" class="some-class" />:
(?:<icon )(?=(?:.*?(?:use=(?:"|')(.*?)(?:"|')))?)(?=(?:.*?(?:class=(?:"|')(.*?)(?:"|')))?)(?:.*?)(?: \/)?[^?](?:>)
This mostly works, except that if I don't specify a class, but do specify one on another element on the same line, it'll match that other elements class, even though the full match is reported as just being the icon element.
For example:
<icon use="search" /> <div class="test"></div>
$1 for that is search, and $2 is test, even though they're not part of the same element. $& is reporting <icon use="search" />.
I'm sure I'm missing something obvious about the way regular expressions work.
The .*? just before the match of class= will match ANYTHING it has to in order to make the rest of the regex match - including the end of the first tag and the start of the second one, and everything that might lie in between. The only restriction you've placed on it is that it can't cross a line boundary, as newlines are not matched by . by default. To make this work somewhat more reliably, you'd need to restrict that part of the regex so that it cannot cross a tag boundary: [^<]+? (one or more characters that aren't a left angle bracket, matching as few as possible) should do the job.

Regular expression is correct, but doesn't work in Notepad++

I would like to drop a table cell from all of our XSL templates.
The code is the following:
<td width="100"><img src="/logos/code.png" border="0" width="100"/></td>
The code.png is different in every file. My regex is the following:
\<td.*\>\<img.*\/logos\/.*png.*\/\>\<\/td\>
I tested the expression on https://regex101.com/ and it matches to the above string, but when I try to find & replace with Notepad++, it gives me no match.
My xsl is all in one line, so line break cannot be the problem. Can someone help me, and give me a pattern that works in NP++?
You must not espace < and >.
Here is your regex : <td.*?><img.*?\/logos\/.*?png.*?\/><\/td>.
I also added ? to our .* to ensure it won't act as greedy.

Regular expression to remove repeated slashes that are after a specific word (VBScript/Classic ASP)

I'm struggling here, trying to figure out how to replace all double slashes that come after a specific word.
Example:
<img alt="" src="/pt/webf//2015//47384_1.JPG" height="235" width="378" />
<div>Don't remove this // or this//</div>
I want the string above to look like this:
<img alt="" src="/pt/webf/2015/47384_1.JPG" height="235" width="378" />
<div>Don't remove this // or this//</div>
Notice the double slashes have been replaced with just one slash in the img tag but left unscathed in the div tag. I only want to replace the double slashes IF they come after the word: pt.
I tried something like this:
(?=pt)((.*?)\/\/)+
However, the first thing wrong with it is (?=) does not do pattern backtracking, as far as I'm aware. That is, it'll only look for the first matching pattern. The second thing wrong with it is it doesn't work as I intended it to.
https://regex101.com/r/kC4tA5/1
Or maybe I'm going about this the wrong way, since regular expression support is not expansive in VBScript/Classic ASP and I should try to break up the string and process, instead of trying to do everything in one regular expression???
Any help would be appreciated.
Thank you.
I am interpreting your issue as "Removing repeated slashes in all <img src> attributes."
As I said in the comments, working with HTML requires a parser. HTML is too complex for regular expressions, all kinds of things can go wrong.
Luckily, there is a parser available to VBScript: The htmlfile object. It creates a standard DOM from your HTML string. So the solution becomes exactly as described:
Function FixHtml(htmlString)
Dim doc, img, slashes
Set slashes = New RegExp
slashes.Pattern = "/+"
slashes.Global = True
Set doc = CreateObject("htmlfile")
doc.Write htmlString
For Each img In doc.getElementsByTagName("IMG")
img.src = slashes.Replace(img.src, "/")
img.src = Replace(Replace(img.src, "about:blank", ""), "about:", "")
Next
FixHtml = doc.body.innerHTML
End Function
Unfortunately, htmlfile is not the most advanced HTML parser in the world, but rest assured that it will still do way better than any regex.
There are two minor issues:
I found in my tests that for some reason it insists on prepending the img.src with about: or about:blank. This should not happen, but it does. The second line of Replace() calls gets rid of the unwanted additions.
The .innerHTML will produce tags names in upper case, so <img> becomes <IMG> in the output. Also insignificant line breaks in the HTML source might be removed. This is a minor annoyance, I recommend you don't obsess over it.(*)
But there are two big plus sides as well:
The DOM puts you in a position where you can work with the input in a structured way. You can put in any number of complex fixes now that would have been impossible to do with regex.
The return value of .innerHTML is sane HTML. It will fix any gross blunder in the input and turn it into something that is well-nested, well-escaped and otherwise well-behaved.
(*) If you do find yourself obsessing over it, you can use the wisdom from this blog post to create a function that replaces all uppercase tags that come out of .innerHTML with lowercase versions of themselves. This actually is something you can use regex for ("(</?[A-Z]+)", to be exact), because we know that there will be no stray < not belonging to a tag anywhere in the string, because that's .innerHTML's guarantee. While it would be a nice exercise (and it introduces you to the little-known fact that VBScript has function pointers), I would say it's not really worth it.

Using ant <propertyregex>, how can I capture the /etc/shadow record for a user?

From ant, we want to extract a line from an old /etc/shadow file, capturing the line for a specific user name, such as "manager". This is part of a backup/restore operation. What we used previously was not specific enough, so it would match users like "mymanager", so we tried to tighten it down by anchoring the start of the string to beginning of the line (typically "^"). This definitely did not work as we expected.
How can we anchor so that we get an exact match for a username? -- answered below.
First attempt, which gave the wrong result if we had a user of "mymanager" in the /etc/shadow file copy:
<loadfile property="oldPasswords" srcFile="${backup.dir}/shadow"/>
<propertyregex property="manager.backup" input="${oldPasswords}"
regexp="(manager\:.*)" select="\1" casesensitive="true" />
Second attempt, which failed because "^" is not interpreted in the normal regular expression way by default:
<loadfile property="oldPasswords" srcFile="${backup.dir}/shadow"/>
<propertyregex property="manager.backup" input="${oldPasswords}"
regexp="^(manager\:.*)" select="\1" casesensitive="true" />
Kobi suggested adding -> flags="m" <- which sounded good but ant reported that the flags option is not supported by propertyregex.
The final, successful, approach required inserting "(?m)" at the beginning of the regexp: That was the essential change.
<propertyregex property="manager.backup" input="${oldPasswords}"
regexp="(?m)^manager:.*$" select="\0" casesensitive="true" />
The regexp with propertyregex appears to follow the rules in this documentation of regular expressions in Java (search for "multiline" for example): http://docs.oracle.com/javase/6/docs/api/java/util/regex/Pattern.html
Check the above document if you have similar questions about how to make propertyregex and regexp do what you want them to do!
THANKS! Solved.
Alan Carwile
I think the m(ultiline) flag is what you want to use and will give the start-of-line anchor the right behavior. It's possible to change flags within the regular expression with the syntax (?<flagstoturnon>-<flagstoturnoff>). So in your case, adding (?m) to the start of the regular expression (before the caret) should work.