Capturing group is too greedy - regex

I have been working hard to get a regular expression to work for me, but I'm stuck on the last part. My goal is to remove an xml element when it is contained within specific parent elements. The example xml looks like so:
<ac:image ac:width="500">
<ri:attachment ri:filename="image2013-10-31 11:21:16.png">
<ri:page ri:content-title="Banana Farts" /> /* REMOVE THIS */
</ri:attachment>
</ac:image>
The expression I have written is:
(<ac:image.*?>)(<ri:attachment.*?)(<ri:page.*? />)(</ri:attachment></ac:image>)
In more readable format, I am searching on four groups
(<ac:image.*?>) //Find open image tag
(<ri:attachment.*?) //Find open attachment tag
(<ri:page.*? />) //Find the page tag
(</ri:attachment></ac:image>) //Find close image and attachment tags
And this basically works because I can remove the page element in notepad++ with:
/1/2/4
My issue is that the search is too greedy. In an example like below it grabs everything from start to finish, when really only the second image tag is a valid find.
<ac:image ac:width="500">
<ri:attachment ri:filename="image2013-10-31 11:21:16.png" />
</ac:image>
<ac:image ac:width="500">
<ri:attachment ri:filename="image2013-10-31 11:21:16.png">
<ri:page ri:content-title="Employee Portal Editor" />
</ri:attachment>
</ac:image>
Can anyone help me finish this up? I thought all I had to do was add ? to make the closing tag group not greedy, but it failed to work.

Keep in mind that a regex engine will try all that is possible to make the pattern succeed. Since you use several .*? in your pattern, you let a lot of flexibility to the regex engine to pursue this purpose. The pattern must be more binding.
To do that, you can replace all the .*? with [^>]*
Don't forget to add optional white-spaces between each tag \s* in the pattern.
Example:
(<ac:image[^>]*> \s* <ri:attachment[^>]*> ) # group 1
\s* <ri:page[^>]*/> \s* # what you need to remove
(</ri:attachment> \s* </ac:image>) # group 2
replacement: $1$2

Related

Notepad++ html tag / string (a href) replace

I found another post that uses the following regex <a[^>]*>([^<]+)</a> it works great however I want to use a capture group to target URLs that have the following 4 letters in them RTRD.
I used <a[^>]*>(RTRD+)</a> and that did not work.
TESTER I want to remove the URL and leave TESTER
LEAVE I want to not touch this one.
One that will work: <a\s[^>]*href\=[\"][^\"]*(RTRD)[^\"]*[\"][^>]*>([^<]+)<\/a>
Decomposition:
<a\s[^>]* find opening a tag with space followed by some arguments
href\=[\"][^\"]* find href attribute with " opening and then multiple non " closing
(RTRD) Your Key group
[^\"]*[\"] Find remainder of argument and closing "
[^>]*>([^<]+)<\/a> The remainder of the original regex
Things your original RegExp would match:
<a stuffhere!!.,?>RTRDDD</a>
<a>RTRD</a>
Decomposing your RegExp:
<a[^>]*> Look for opening tag with any properties
(RTRD+) Look for the RTRD group but also match one or more D
<a[^>]*> Look for closing tag
Use <a[^>]*RTRD[^>]*>([^<]+)<\/a> here.
Inside the opening tag (<a[^>]*>) should be the pattern RTRD somewhere. This can be done by replacing [^>]* with [^>]*RTRB[^>]*which is simply
[^>]* Anything thats not a >(closing tag)
RTRB The pattern RTRB
[^>]* Again anything thats not a >
But caution: This also matches <aRTRB>test</a> or <a id="RTRB">blubb</a>
And if you have any other way than using Regex on HTML, use that way (string operations etc)

RegEX Trying to trigger optional search

So I am using this as a string:
Username entry \([a-z]{3,15}
To search this as an example:
[PA]apf_ms.c:7678 Username entry (host/computer.domain.com) is deleted for mobile a4:c4:94:63:1c:7a
[PA]apf_ms.c:7678 Username entry (username#domain.com) is deleted for mobile 94:e9:6a:ad:14:4d
Trying to wrap my head around regex and it's driving me nuts. My search only gets me so far, I am trying to make host/ optional and can't figure out where to insert it.
Username entry \((?:host/)?[a-z]{3,15}
(?: ... ) is a non-capture group
host/ is what you want to match
? after it means optional
You can use ? make anything optional in regex. The regex can be written as
Username entry \((?:host\/)?[a-z]{3,15}
(?:host\/)? Matches one or zero host/. The ?: within the brackets prevent it from capturing, as we don't want to save the host/ for future use.
Regex Demo

Google Analytics - Content grouping - Regex fix

This is our URL structure:
http://www.disabledgo.com/access-guide/the-university-of-manchester/176-waterloo-place-2
http://www.disabledgo.com/access-guide/kingston-university/coombehurst-court-2
http://www.disabledgo.com/access-guide/kings-college-london/franklin-wilkins-building-2
http://www.disabledgo.com/access-guide/redbridge-college/brook-centre-learning-resource-centre
I am trying to create a list of groups based on the client names
/access-guide/[this bit]/...
So I can have a performance list of all our clients.
This is my regex:
/access-guide/(.*universit(y|ies)|.*colleg(e|es))/
I want it to group anything that has university/ies or college/es in it, at any point within that client name section of the URL.
At the moment, my current regex will only return groups that are X-University:
Durham-University
Plymouth-University
Cardiff-University
etc.
What does the regex need to be to have the list I'm looking for?
Do I need to have something at the end to stop it matching things after the client name? E.g. ([^/]+$)?
Thanks for your help in advance!
Depending upon your needs you may want to do:
/access-guide/([^/]*(?:university|universities|college|colleges)[^/]*)/
This will match names even if "university" or "college" is not at the end of the string. For example "college-of-the-ozarks" Note the non-capturing internal parenthesis, that should probably be used no matter what solution you go with, as you don't want to just match the word "university" or "college"
Live Example
Additionally, I don't know what may be in your but if you may have compound words you want to eliminate using a \b may be advisable. For instance if you don't want to match "miskatonic-postcollege" you may want to do something like this:
/access-guide/([^/]*\b(?:university|universities|college|colleges)\b[^/]*)/
If the client name section of the URL is after the access-guid/ and before the next /:
http://www.disabledgo.com/access-guide/the-university-of-manchester/176-waterloo-place-2
|----------------------------|
you need to use a negated character class to only match university before the regex reaches that rightmost / boundary.
As per the Reference:
You can extract pages by Page URL, Page Title, or Screen Name. Identify each one with a regex capture group (Analytics uses the first capture group for each expression)
Thus, you can use
/access-guide/([^/]*(universit(y|ies)|colleges?))
^^^^^
See demo.
The regex matches
/access-guide/ - leftmost boundary, matches /access-guide/ literally
[^/]* - any character other than / (so we still remain in that customer section)
(universit(y|ies)|colleges?) - university, or universities, orcollegeorcolleges` literally. Add more if needed.

Regex to match anything after /

I'm basically not in the clue about regex but I need a regex statement that will recognise anything after the / in a URL.
Basically, i'm developing a site for someone and a page's URL (Local URL of Course) is say (http://)localhost/sweettemptations/available-sweets. This page is filled with custom post types (It's a WordPress site) which have the URL of (http://)localhost/sweettemptations/sweets/sweet-name.
What I want to do is redirect the URL (http://)localhost/sweettemptations/sweets back to (http://)localhost/sweettemptations/available-sweets which is easy to do, but I also need to redirect any type of sweet back to (http://)localhost/sweettemptations/available-sweets. So say I need to redirect (http://)localhost/sweettemptations/sweets/* back to (http://)localhost/sweettemptations/available-sweets.
If anyone could help by telling me how to write a proper regex statement to match everything after sweets/ in the URL, it would be hugely appreciated.
To do what you ask you need to use groups. In regular expression groups allow you to isolate parts of the whole match.
for example:
input string of: aaaaaaaabbbbcccc
regex: a*(b*)
The parenthesis mark a group in this case it will be group 1 since it is the first in the pattern.
Note: group 0 is implicit and is the complete match.
So the matches in my above case will be:
group 0: aaaaaaaabbbb
group 1: bbbb
In order to achieve what you want with the sweets pattern above, you just need to put a group around the end.
possible solution: /sweets/(.*)
the more precise you are with the pattern before the group the less likely you will have a possible false positive.
If what you really want is to match anything after the last / you can take another approach:
possible other solution: /([^/]*)
The pattern above will find a / with a string of characters that are NOT another / and keep it in group 1. Issue here is that you could match things that do not have sweets in the URL.
Note if you do not mind the / at the beginning then just remove the ( and ) and you do not have to worry about groups.
I like to use http://regexpal.com/ to test my regex.. It will mark in different colors the different matches.
Hope this helps.
I may have misunderstood you requirement in my original post.
if you just want to change any string that matches
(http://)localhost/sweettemptations/sweets/*
into the other one you provided (without adding the part match by your * at the end) I would use a regular expression to match the pattern in the URL but them just blind replace the whole string with the desired one:
(http://)localhost/sweettemptations/available-sweets
So if you want the URL:
http://localhost/sweettemptations/sweets/somethingmore.html
to turn into:
http://localhost/sweettemptations/available-sweets
and not into:
localhost/sweettemptations/available-sweets/somethingmore.html
Then the solution is simpler, no groups required :).
when doing this I would make sure you do not match the "localhost" part. Also I am assuming the (http://) really means an optional http:// in front as (http://) is not a valid protocol prefix.
so if that is what you want then this should match the pattern:
(http://)?[^/]+/sweettemptations/sweets/.*
This regular expression will match the http:// part optionally with a host (be it localhost, an IP or the host name). You could omit the .* at the end if you want.
If that pattern matches just replace the whole URL with the one you want to redirect to.
use this regular expression (?<=://).+

Regex for finding elements without a certain attribute (e.g., "id")

I'm scrubbing through a large number of XML based files in a JSF project, and would like to find certain components that are missing an ID attribute. For example, let's say I want to find all of the <h:inputText /> elements that do not have an id-attribute specified.
I've tried the following in RAD (Eclipse), but something's not quite right because I still get some components that do have a valid ID.
<([hf]|ig):(?!output)\w+\s+(?!\bid\b)[^>]*?\s+(?!\bid\b)[^>]*?>
Not sure if my negative-lookahead is correct or not?
The desired result would be that I would find the following (or similar) in any JSP in the project:
<h:inputText value="test" />
... but not:
<h:inputText id="good_id" value="test" />
I'm just using <h:inputText/> as an example. I was trying to be broader than that, but definitely excluding <h:outputText/>.
Disclaimer:
As others correctly point out, it is best to use a dedicated parser when working with non-regular markup languages such as XML/HTML. There are many ways for a regex solution to fail with either false positives or missed matches.
That said...
This particular problem is a one-shot editing problem and the target text (an open tag) is not a nested structure. Although there are ways for the following regex solution to fail, it should still do a pretty good job.
I don't know Eclipse's regex syntax, but if it provides negative lookahead, the following is a regex solution that will match a list of specific target elements which do not have an ID attribute: (First, presented in PHP/PCRE free-spacing mode commented syntax for readability)
$re_open_tags_with_no_id_attrib = '%
# Match specific element open tags having no "id" attribute.
< # Literal "<" start of open tag.
(?: # Group of target element names.
h:inputText # Either h:inputText element,
| h:otherTag # or h:otherTag element,
| h:anotherTag # or h:anotherTag element.
) # End group of target element names.
(?: # Zero or more open tag attributes.
\s+ # Whitespace required before each attribute.
(?!id\b) # Assert this attribute not named "id".
[\w\-.:]+ # Non-"id" attribute name.
(?: # Group for optional attribute value.
\s*=\s* # Value separated by =, optional ws.
(?: # Group of attrib value alternatives.
"[^"]*" # Either double quoted value,
| \'[^\']*\' # or single quoted value,
| [\w\-.:]+ # or unquoted value.
) # End group of value alternatives.
)? # Attribute value is optional.
)* # Zero or more open tag attributes.
\s* # Optional whitespace before close.
/? # Optional empty tag slash before >.
> # Literal ">" end of open tag.
%x';
And here is the same regex in bare-bones native format which may be suitable for copy and paste into an Eclipse search box:
<(?:h:inputText|h:otherTag|h:anotherTag)(?:\s+(?!id\b)[\w\-.:]+(?:\s*=\s*(?:"[^"]*"|'[^']*'|[\w\-.:]+))?)*\s*/?>
Note the group of target element names to be matched at the beginning of this expression. You can add or subtract desired target elements to this ORed list. Note also that this expression is designed to work pretty well for HTML as well as XML (which may have value-less attributes, unquoted attribute values and quoted attribute values containing <> angle brackets).