Regex for finding elements without a certain attribute (e.g., "id") - regex

I'm scrubbing through a large number of XML based files in a JSF project, and would like to find certain components that are missing an ID attribute. For example, let's say I want to find all of the <h:inputText /> elements that do not have an id-attribute specified.
I've tried the following in RAD (Eclipse), but something's not quite right because I still get some components that do have a valid ID.
<([hf]|ig):(?!output)\w+\s+(?!\bid\b)[^>]*?\s+(?!\bid\b)[^>]*?>
Not sure if my negative-lookahead is correct or not?
The desired result would be that I would find the following (or similar) in any JSP in the project:
<h:inputText value="test" />
... but not:
<h:inputText id="good_id" value="test" />
I'm just using <h:inputText/> as an example. I was trying to be broader than that, but definitely excluding <h:outputText/>.

Disclaimer:
As others correctly point out, it is best to use a dedicated parser when working with non-regular markup languages such as XML/HTML. There are many ways for a regex solution to fail with either false positives or missed matches.
That said...
This particular problem is a one-shot editing problem and the target text (an open tag) is not a nested structure. Although there are ways for the following regex solution to fail, it should still do a pretty good job.
I don't know Eclipse's regex syntax, but if it provides negative lookahead, the following is a regex solution that will match a list of specific target elements which do not have an ID attribute: (First, presented in PHP/PCRE free-spacing mode commented syntax for readability)
$re_open_tags_with_no_id_attrib = '%
# Match specific element open tags having no "id" attribute.
< # Literal "<" start of open tag.
(?: # Group of target element names.
h:inputText # Either h:inputText element,
| h:otherTag # or h:otherTag element,
| h:anotherTag # or h:anotherTag element.
) # End group of target element names.
(?: # Zero or more open tag attributes.
\s+ # Whitespace required before each attribute.
(?!id\b) # Assert this attribute not named "id".
[\w\-.:]+ # Non-"id" attribute name.
(?: # Group for optional attribute value.
\s*=\s* # Value separated by =, optional ws.
(?: # Group of attrib value alternatives.
"[^"]*" # Either double quoted value,
| \'[^\']*\' # or single quoted value,
| [\w\-.:]+ # or unquoted value.
) # End group of value alternatives.
)? # Attribute value is optional.
)* # Zero or more open tag attributes.
\s* # Optional whitespace before close.
/? # Optional empty tag slash before >.
> # Literal ">" end of open tag.
%x';
And here is the same regex in bare-bones native format which may be suitable for copy and paste into an Eclipse search box:
<(?:h:inputText|h:otherTag|h:anotherTag)(?:\s+(?!id\b)[\w\-.:]+(?:\s*=\s*(?:"[^"]*"|'[^']*'|[\w\-.:]+))?)*\s*/?>
Note the group of target element names to be matched at the beginning of this expression. You can add or subtract desired target elements to this ORed list. Note also that this expression is designed to work pretty well for HTML as well as XML (which may have value-less attributes, unquoted attribute values and quoted attribute values containing <> angle brackets).

Related

Regex To Exclude First Part of String

New at this so thanks in advance for the help.
I'm looking to write a Regex that will match the end of the string but not the beginning and there are some cases where the string is only one character.
Here are the sample strings and I'm trying to match only the items shown, otherwise there is no match.
/en-ca/brand/atf-type-f/ # should match /brand/atf-type-f/
/ # no match
/en-ca # no match
/en-ca/ # no match
/es-xl # no match
/en-gb # no match
/ru-kz/ # no match
/knowledge-centre/sds # should match /knowledge-centre/sds
/en-us/brand/purity-fg # should match /brand/purity-fg
The Regex engine I'm using to Google Analytics and I'm looking to output the Page Path without the country ID and the language ID.
Figured this out.
Using the Advanced Filter within GA I:
1) Used regex with ^(/..-..)?(/)?(.*)
2) used the Output To -> Constructor to put up the groups I wanted. Each () within GA Output Constructor is numbered. Therefore $A1 pickups first part and so on. Therefore just returning $A3 gave me the path. Had to added / back in at the beginning so the output statement became /$A3
Hope this help someone else.

Capturing group is too greedy

I have been working hard to get a regular expression to work for me, but I'm stuck on the last part. My goal is to remove an xml element when it is contained within specific parent elements. The example xml looks like so:
<ac:image ac:width="500">
<ri:attachment ri:filename="image2013-10-31 11:21:16.png">
<ri:page ri:content-title="Banana Farts" /> /* REMOVE THIS */
</ri:attachment>
</ac:image>
The expression I have written is:
(<ac:image.*?>)(<ri:attachment.*?)(<ri:page.*? />)(</ri:attachment></ac:image>)
In more readable format, I am searching on four groups
(<ac:image.*?>) //Find open image tag
(<ri:attachment.*?) //Find open attachment tag
(<ri:page.*? />) //Find the page tag
(</ri:attachment></ac:image>) //Find close image and attachment tags
And this basically works because I can remove the page element in notepad++ with:
/1/2/4
My issue is that the search is too greedy. In an example like below it grabs everything from start to finish, when really only the second image tag is a valid find.
<ac:image ac:width="500">
<ri:attachment ri:filename="image2013-10-31 11:21:16.png" />
</ac:image>
<ac:image ac:width="500">
<ri:attachment ri:filename="image2013-10-31 11:21:16.png">
<ri:page ri:content-title="Employee Portal Editor" />
</ri:attachment>
</ac:image>
Can anyone help me finish this up? I thought all I had to do was add ? to make the closing tag group not greedy, but it failed to work.
Keep in mind that a regex engine will try all that is possible to make the pattern succeed. Since you use several .*? in your pattern, you let a lot of flexibility to the regex engine to pursue this purpose. The pattern must be more binding.
To do that, you can replace all the .*? with [^>]*
Don't forget to add optional white-spaces between each tag \s* in the pattern.
Example:
(<ac:image[^>]*> \s* <ri:attachment[^>]*> ) # group 1
\s* <ri:page[^>]*/> \s* # what you need to remove
(</ri:attachment> \s* </ac:image>) # group 2
replacement: $1$2

Removing commas and empty tags from a string using regex

I am trying to filter out spam before being posted using a few routines and external services (akismet) but they all seem to fail when pushing in a comma delimited word or a word formed with empty tags. Eg
b[u][/u]u[u][/u]y[i][/i]m[b][/b] e <-> buyme
b,u,y,m,e <-> buyme
Does anyone know of a good ColdFusion regex to strip out this sort of behavior before I can post it to aksimet for processing?
Firstly: Have you checked whether is Akismet not already doing this?
I would very much suspect it already does all this processing (and more), so you don't actually need to.
Anyway, assuming this is bbcode, and thus the relevant tags will be for bold/italic/underline, you can replace them with:
TextForAkismet = rereplace( TextForAkismet , '\[([biu])\]\[/\1\]' , '' , 'all' )
If there are other empty tags you want to remove, simply update the captured group (the bit in parentheses) as appropriate. To also cater for potentially attributes (but still an empty tag), a quick and dirty way is to use [^\]]* after the tag name (outside the captured group).
'\[([biu]|img|url)[^\]]*\]\[/\1\]'
Depending on the dialect of bbcode you're working with, you may need to handle quoted brackets which would need a more complex expression.
To remove commas that appear between letters, use:
TextForAkismet = rereplace( TextForAkismet , '\b,\b' , '' , 'all' )
(Where \b matches any position between alphanumeric and non-alphanumeric.)

Regular Expression break down URL into parts

I've just recently started learning Regex so i'm not sure yet about a couple of aspects of the hole thing.
Right now my web page reads in the URL breaks it up into parts and only uses certain parts for processing:
E.g. 1) http://mycontoso.com/products/luggage/selloBag
E.g. 2) http://mycontoso.com/products/luggage/selloBag.sf404.aspx
For some reason Sitefinity is giving us both possibilities, which is fine, but what I need from this is only the actual product details as in "luggage/selloBag"
My current Regex expression is: "(.*)(map-search)(\/)(.*)(\.sf404\.aspx)", I combine this with a replace statement and extract the contents of group 4 (or $4), which is fine, but it doesn't work for example 2.
So the question is: Is it possible to match 2 possibilities with regular expressions where a part of a string might or might not be there and then still reference a group whose value you actually want to use?
RFC-3986 is the authority regarding URIs. Appendix B provides this regex to break one down into its components:
re_3986 = r"^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?"
# Where:
# scheme = $2
# authority = $4
# path = $5
# query = $7
# fragment = $9
Here is an enhanced (and commented) regex (in Python syntax) which utilizes named capture groups:
re_3986_enhanced = re.compile(r"""
# Parse and capture RFC-3986 Generic URI components.
^ # anchor to beginning of string
(?: (?P<scheme> [^:/?#\s]+): )? # capture optional scheme
(?://(?P<authority> [^/?#\s]*) )? # capture optional authority
(?P<path> [^?#\s]*) # capture required path
(?:\?(?P<query> [^#\s]*) )? # capture optional query
(?:\#(?P<fragment> [^\s]*) )? # capture optional fragment
$ # anchor to end of string
""", re.MULTILINE | re.VERBOSE)
For more information regarding the picking apart and validation of a URI according to RFC-3986, you may want to take a look at an article I've been working on: Regular Expression URI Validation
Depends on your regex implementation, but most support a syntax like
(\.sf404\.aspx|)
Assuming that's your group 4 (i.e. zero-indexed groups). The | lists two alternatives, one of which is the empty string.
You don't say if you're doing this in javascript, but if you are, the parseUri lib written by Steven Levithan does a pretty damn good job at parsing urls. You can get it from various places, including here (click on the "Source Code" tab) and here.

parse url from string in coldfusion

i need to parse all urls from a paragraph(string)
eg.
"check out this site google.com and don't forget to see this too bing.com/maps"
it should return "google.com and bing.com/maps"
i'm currently using this and its not to perfection.
reMatch("(^|\s)[^\s#]+\.[^\s#\?\/]{2,5}((\?|\/)\S*)?",mystring)
thanks
You need to define more clearly what you consider a URL
For example, I might use something such as this:
(?:https?:)?(?://)?(?:[\w-]+\.)+[a-z]{2,6}(?::\d+)?(?:/[\w.,-]+)*(?:\?\S+)?
(use with reMatchNoCase or plonk (?i) at front to ignore case)
Which specifically only allows alphanumerics, underscore, and hyphen in domain and path parts, requires the TLD to be letters only, and only looks for numeric ports.
It might be this is good enough, or you may need something that looks for more characters, or perhaps you want to trim things likes quotes, brackets, etc off the end of the URL, or whatever - it depends on the context of what you're doing as to whether you'd like to err towards missing URLs or detecting non-URLs.
(I'd probably go for the latter, then potentially run a secondary filter to verify if something is a URL, but that takes more work, and may not be necessary for what you're doing.)
Anyhow, the explanation of the above expression is below, hopefully with clear comments to help it make sense. :)
(Note that all groups are non-capturing (?:...) since we don't need the indiv parts.)
# PROTOCOL
(?:https?:)? # optional group of "http:" or "https:"
# SERVER NAME / DOMAIN
(?://)? # optional double forward slash
(?:[\w-]+\.)+ # one or more "word characters" or hyphens, followed by a literal .
# grouped together and repeated one or more times
[a-z]{2,6} # as many as 6 alphas, but at least 2
# PORT NUMBER
(?::\d+)? # an optional group made up of : and one or more digits
# PATH INFO
(?:/[\w.,-]+)* # a forward slash then multiple alphanumeric, underscores, or hyphens
# or dots or commas (add any other characters as required)
# in a group that might occur multiple times (or not at all)
# QUERY STRING
(?:\?\S+)? # an optional group containing ? then any non-whitespace
Update:
To prevent the end of email addresses being matched, we need to use a lookbehind, to ensure that prior to the URL we don't have an # sign (or anything else unwanted) but without actually including that prior character in the match.
CF's regex is Apache ORO which doesn't support lookbehinds, but we can use the java.util.regex nice and easily with a component I have created which does support lookbehinds.
Using that is as simple as:
<cfset jrex = createObject('component','jre-utils').init('CASE_INSENSITIVE') />
...
<cfset Urls = jrex.match( regex , input ) />
After the createObject, it should basically be like using the built-in re~ stuff, but with the slight syntax difference, and the different regex engine under the hood.
(If you have any problems or questions with the component, let me know.)
So, on to your excluding emails from URL matching problem:
We can either do a (?<=positive) or (?<!negative) lookbehind, depending on if we want to say "we must have this" or "we must not have this", like so:
(?<=\s) # there must be whitespace before the current position
(?<!#) # there must NOT be an # before current position
For this URL example, I would expand either of those examples to:
(?<=\s|^) # look for whitespace OR start of string
or
(?<![#\w/]) # ensure there is not a # or / or word character.
Both will work (and can be expanded with more chars), but in different ways, so it simply depends which method you want to do it with.
Put whichever one you like at the start of your expression, and it should no longer match the end of abcd#gmail.com, unless I've screwed something up. :)
Update 2:
Here is some sample code which will exclude any email addresses from the match:
<cfset jrex = createObject('component','jre-utils').init('CASE_INSENSITIVE') />
<cfsavecontent variable="SampleInput">
check out this site google.com and don't forget to see this too bing.com/maps
this is an email#somewhere.com which should not be matched
</cfsavecontent>
<cfset FindUrlRegex = '(?<=\s|^)(?:https?:)?(?://)?(?:[\w-]+\.)+[a-z]{2,6}(?::\d+)?(?:/[\w.,-]+)*(?:\?\S+)?' />
<cfset MatchedUrls = jrex.match( FindUrlRegex , SampleInput ) />
<cfdump var=#MatchedUrls#/>
Make sure you have downloaded the jre-utils.cfc from here and put in an appropriate place (e.g. same directory as script running this code).
This step is required because the (?<=...) construct does not work in CF regular expressions.