Regular expression for XML - regex

I'm attempting to build a regular expression that will match against the contents of an XML element containing some un-encoded data. Eg:
<myElement><![CDATA[<p>The <a href="http://blah"> draft </p>]]></myElement>
Usually in this circumstance I'd use
[^<]*
to match everything up to the less than sign but this isn't working in this case. I've also tried this unsuccessfully:
[^(</myElement>)]*
I'm using Groovy, i.e. Java.

Please don't do this, but you're probably looking for:
<myElement>(.*?)</myElement>
This won't work if <myElement> (or the closing tag) can appear in the CDATA. It won't work if the XML is malformed. It also won't work with nested <myElement>s. And the list goes on...
The proper solution is to use a real XML parser.
Your [^(</myElement>)]* regex was saying: match any number of characters that are not in the set (, <, /, m, etc., which is clearly not what you intended. You cannot place a group within a character class in order for it to be treated atomically -- the characters will always be treated as a set (with ( and ) being literal characters, too).

if you are doing it on a line by line basis, this will match the inside if your example:
>(.*)</
returns: <![CDATA[<p>The <a href="http://blah"> draft </p>]]>
Probably use it something like this:
subjectString = '<myElement><![CDATA[<p>The <a href="http://blah"> draft </p>]]></myElement>';
Matcher regexMatcher = subjectString =~ ">(.*)</"
if (regexMatcher.find()) {
String ResultString = regexMatcher.group();
}

Related

Regular Expression - Starting and ending with, and contains specific string in the middle

I would like to generate a regex with the following condition:
The string "EVENT" is contained within a xml tag called "SHEM-HAKOVETZ".
For example, the following string should be a match:
<SHEM-HAKOVETZ>104000514813450EVENTS0001dfd0.DAT</SHEM-HAKOVETZ>
I think you want something like this ^<SHEM-HAKOVETZ>.*EVENT.*<\/SHEM-HAKOVETZ>$
Regular expression
^<SHEM-HAKOVETZ>.*EVENTS.*<\/SHEM-HAKOVETZ>$
Parts of the regular expression
^ From the beginning of the line
<SHEM-HAKOVETZ> Starting tag
.* Any character - zero or more
EVENT Middle part
<\/SHEM-HAKOVETZ>$ Ending part of the match
Here is the working regex.
If you want to match this line, you could use this regex:
<SHEM-HAKOVETZ>*EVENTS.*(?=<\/SHEM-HAKOVETZ>)
However, I would not recommend using regex XML-based data, because there may be problems with whitespace handling in XML (see this article for more information). I would suggest using an actual XML parser (and then applying the reg to be sure about your results.
Here is a solution to only match the "value" part ignoring the XML tags:
(?<=<SHEM-HAKOVETZ>)(?:.*EVENTS.*)(?=<\/SHEM-HAKOVETZ>)
You can check it out in action at: https://regex101.com/r/4XiRch/1
It works with Lookbehind and Lookahead to make sure it will only match if the tags are correct, but for further coding will only match the content.

Parsing variables within a string using a Regular Expression

I've got a bit of a problem with regular expressions with ColdFusion.
I have a string:
Hi my name is {firstname}. and i live in {towncity} my email address is {email}
What I would like to know is how would I go about finding all strings, within my string, that are encased within a set of {} brackets? I would like to split all the matching strings into an array so I can use the results of query data.
Also is this a commonly used pattern for processing strings within matching strings for merging variable data ?
Any help greatly appreciated.
Simple Answer
To find all the brace-encased strings, you can use rematch and the simple expression \{[^{}]+\}
Explanation
The backslashes \ before each brace are to escape them, and have them act as literal braces (they carry special meaning otherwise).
The [^...] is a negative character class, saying match any single char that is NOT one of those contained within, and the greedy + quantifier tells it to match as many as possible, but at least one, from the preceding item.
Thus using [^{}]+ between the braces means it will not match nested or unmatched braces. (Whilst using \{.*?\} could match two opening braces. Note: the *? is a lazy quantifier, it matches nothing (if possible), but as many as required.)
Extended Answer
However, since you say that the results come from a query, a way to only match the values you're dealing with is to use the query's ColumnList to form an expression:
`\{(#ListChangeDelims(QueryName.ColumnList,'|')#)\}`
This changes ColumnList into a pipe-delimited list - a set of alternatives, grouped by the parentheses - i.e. the generated pattern will be like:
\{(first_name|towncity|email)\}
(with the contents of that group going into capture group 1).
To actually populate the text (rather than just matching) you could do something similar, except there is no need for a regex here, just a straight replace whilst looping through columns:
<cfloop index="CurColumn" list=#QueryName.ColumnList#>
<cfset text = replace( text , '{#CurColumn#}' , QueryName[CurColumn][CurrentRow] , 'all' ) />
</cfloop>
(Since this is a standard replace, there's no need to escape the braces with backslashes; they have no special meaning here.)
Use the reMatch(reg_expression, string_to_search) function.
The details on Regular Expressions in Coldfusion 10 are here. (I believe the regexp in CF8 would be roughly the same.)
Use the following code.
<cfset str = "Hi my name is {firstname}. And I live in {towncity} my email address is {email}.">
<cfoutput>Search string: <b>#str#</b><br />Search result:<br /></cfoutput>
<cfset ret = reMatch("\{[\w\s\(\)\+\.#-]+\}", str)>
<cfdump var ="#ret#">
This returns an array with the following entries.
{firstname}
{towncity}
{email}
The [] brackets in CF regular expressions define a character set to match a single character. You put + after the brackets to match one or more characters from the character set defined inside the []. For example, to match one or more upper case letters you could write [A-Z]+.
As detailed in the link above, CF defines shortcuts to match various characters. The ones I used in the code are: \w to match an alpha-numeric character or an underscore, \s to match a whitespace character (including space, tab, newline, etc.).
To match the following special characters +*?.[^$({|\ you escape them by writing backslash \ before them.
An exception to this is the dash - character, which cannot be escaped with a backslash. So, to use it as a literal simply place it at the very end of the character set, like I did above.
Using the above regular expression you can extract characters from the following string, for example.
<cfset str = "Hi my name is { John Galt}. And I live in {St. Peters-burg } my email address is {john#exam_ple.com}.">
The result would be an array with the following entries.
{ John Galt}
{St. Peters-burg }
{john#exam_ple.com}
There may be much better ways to do this, but using something like rematch( '{.*?}', yourstring ) would give you an array of all the matches.
For future reference, I did this with the excellent RegExr, a really nice online regex checker. Full disclosure, it's not specifically for ColdFusion, but it's a great way to test things out.

Need to extract text from within first curly brackets

I have strings that look like this
{/CSDC} CHOC SHELL DIP COLOR {17}
I need to extract the value in the first swirly brackets. In the above example it would be
/CSDC
So far i have this code which is not working
Dim matchCode = Regex.Matches(txtItems.Text, "/\{(.+?)\}/")
Dim itemCode As String
If matchCode.Count > 0 Then
itemCode = matchCode(0).Value
End If
I think the main issue here is that you are confusing your regular expression syntax between different languages.
In languages like Javascript, Perl, Ruby and others, you create a regular expression object by using the /regex/ notation.
In .NET, when you instantiate a Regex object, you pass it a string of the regular expression, which is delimited by quotes, not slashes. So it is of the form "regex".
So try removing the leading and trailing / from your string and see how you go.
This may not be the whole problem, but it is at least part of it.
Are you getting the whole string instead of just the 1st value? Regular expressions are greedy by default so .Net is trying to grab the largest matching string.
Try this:
Dim matchCode = Regex.Matches(txtItems.Text, "\{[^}]*\}")
Dim itemCode As String
If matchCode.Count > 0 Then
itemCode = matchCode(0).Groups(0).Value
End If
Edited: I've tried this in Linqpad and it worked.
It appears you are using a capture group.. so try matchCode(0).Groups(0).Value
Also, remove the /\ from the beginning of the pattern and remove the trailing /

Regexp Replace (as3) - Use text to find but not to replace

Having some trouble with regexp. My XML file loaded to actionscript removes all spaces (automatically trims the text). So I want to replace all SPACE with a word so that I can fix that later on in my own parsing.
Here's examples of how the tags I want to adjust.
<w:t> </w:t>
<w:t> Test</w:t>
<w:t>Test </w:t>
This is the result I want.
<w:t>%SPACE%</w:t>
<w:t>%SPACE%Test</w:t>
<w:t>Test%SPACE%</w:t>
The closest result I got is <w:t>\s|\s</w:t>
Biggest problem is that it changes all spaces in the XML file that corrupts everything. Will only change inside w:t nodes but not destroy the text.
When parsing XML using the standard XML class in ActionScript you can specify to not ignore whitespace by setting the ignoreWhiteSpace property to false. It is set to true by default. This will ensure that white space in XML text nodes is preserved. You can then do whatever you want with it.
XML.ignoreWhiteSpace = false
/* parse your XML here */
That way you don't have to muck around with regular expressions and can use the standard XML ActionScript parsing.
var reg1 : RegExp = /((?:<w:t>|\G)[^<\s]*+)\s/g;
data = data.replace(reg1, "$1%SPACE%");
(?:<w:t>|\G) means every match starts at a <w:t> tag, or immediately after the previous match. Since [^<\s] can't match the closing </w:t> tag (or any other tag), every match is guaranteed to be inside a <w:t> element.
To do this properly, you would need to deal with some more questions, like:
\s matches several other kinds of whitespace, not just ' '. Do you want to replace any whitespace character with %SPACE%? Or do you know that ' ' will be the only kind of whitespace in those elements?
Will there be other elements inside the <w:t> elements (for example, <w:t> test <xyz> test </xyz> </w:t>)? If so, the regex becomes more complicated, but it's still doable.
I'm not set up to test ActionScript, but here's a demo in PHP, which uses the PCRE library under the hood, like AS3:
test it on ideone.com
EDIT: In addition to matching where the last match left off, \G matches the beginning of the input, just like \A. That's not a problem with the regex given here, but in the ideone demo it is. That regex should be
((?:<w:t>|\G(?!\A))(?:[^<\s]++|<(?!/w:t>))*+)\s
Made a workaround that isn't so nice. But well, problem is when you work against the clock.
I run the replace 3 times instead.
var reg1 : RegExp = /<w:t>\s/gm;
data = data.replace(reg1, "<w:t>%DEADSPACE%");
var reg2 :RegExp = /\s<\/w:t>/gm;
data = data.replace(reg2, "%DEADSPACE%</w:t>");
var reg3 :RegExp = /<w:t>\s<\/w:t>/gm;
data = data.replace(reg3, "<w:t>%DEADSPACE%</w:t>");
RegExp, what is it good for. Absolutly nothing (singing) ;)
there's also another way

Need regexp to find substring between two tokens

I suspect this has already been answered somewhere, but I can't find it, so...
I need to extract a string from between two tokens in a larger string, in which the second token will probably appear again meaning... (pseudo code...)
myString = "A=abc;B=def_3%^123+-;C=123;" ;
myB = getInnerString(myString, "B=", ";" ) ;
method getInnerString(inStr, startToken, endToken){
return inStr.replace( EXPRESSION, "$1");
}
so, when I run this using expression ".+B=(.+);.+"
I get "def_3%^123+-;C=123;" presumably because it just looks for the LAST instance of ';' in the string, rather than stopping at the first one it comes to.
I've tried using (?=) in search of that first ';' but it gives me the same result.
I can't seem to find a regExp reference that explains how one can specify the "NEXT" token rather than the one at the end.
any and all help greatly appreciated.
Similar question on SO:
Regex: To pull out a sub-string between two tags in a string
Regex to replace all \n in a String, but no those inside [code] [/code] tag
Replace patterns that are inside delimiters using a regular expression call
RegEx matching HTML tags and extracting text
You're using a greedy pattern by not specifying the ? in it. Try this:
".+B=(.+?);.+"
Try this:
B=([^;]+);
This matches everything between B= and ; unless it is a ;. So it matches everything between B= and the first ; thereafter.
(This is a continuation of the conversation from the comments to Evan's answer.)
Here's what happens when your (corrected) regex is applied: First, the .+ matches the whole string. Then it backtracks, giving up most of the characters it just matched until it gets to the point where the B= can match. Then the (.+?) matches (and captures) everything it sees until the next part, the semicolon, can match. Then the final .+ gobbles up the remaining characters.
All you're really interested in is the "B=" and the ";" and whatever's between them, so why match the rest of the string? The only reason you have to do that is so you can replace the whole string with the contents of the capturing group. But why bother doing that if you can access contents of the group directly? Here's a demonstration (in Java, because I can't tell what language you're using):
String s = "A=abc;B=def_3%^123+-;C=123;";
Pattern p = Pattern.compile("B=(.*?);");
Matcher m = p.matcher(s);
if (m.find())
{
System.out.println(m.group(1));
}
Why do a 'replace' when a 'find' is so much more straightforward? Probably because your API makes it easier; that's why we do it in Java. Java has several regex-oriented convenience methods in its String class: replaceAll(), replaceFirst(), split(), and matches() (which returns true iff the regex matches the whole string), but not find(). And there's no convenience method for accessing capturing groups, either. We can't match the elegance of Perl one-liners like this:
print $1 if 'A=abc;B=def_3%^123+-;C=123;' =~ /B=(.*?);/;
...so we content ourselves with hacks like this:
System.out.println("A=abc;B=def_3%^123+-;C=123;"
.replaceFirst(".+B=(.*?);.+", "$1"));
Just to be clear, I'm not saying not to use these hacks, or that there's anything wrong with Evan's answer--there isn't. I just think we should understand why we use them, and what trade-offs we're making when we do.