Understanding these Regular Expressions - regex

There is a variable being set as follows (through custom tag invocation)
<cfset str = Trim( THISTAG.GeneratedContent ) />
The contents of THISTAG.GeneratedContent looks like
FNAME|MNAME|LNAME Test|Test|Test
The code I am having trouble understanding is as follows:
<cfset str = str.ReplaceAll(
"(?m)^[\t ]+|[\t ]+$",
""
) />
<cfset arrRows = str.Split( "[\r\n]+" ) />
The above line of code should generate array with contents as
arrRows[1] = FNAME|MNAME|LNAME
arrRows[2] = Test|Test|Test
But on dumping the array shows following output:
FNAME|MNAME|LNAME Test|Test|Test
I do not understand what both regular expressions are trying to achieve.

This one...
<cfset str = str.ReplaceAll(
"(?m)^[\t ]+|[\t ]+$",
""
) />
..is removing any tabs/spaces that are at the beginning or end of lines. The (?m) turns on multiline mode which causes ^ to match "start of line" (as opposed to its usual "start of content"), and similarly $ means "end of line" (rather than "end of content") in this mode.
This one...
<cfset arrRows = str.Split( "[\r\n]+" ) />
...is converting lines to an array, by splitting on any combination of consecutive carriage returns and/or newline characters.
Bonus Info
You can actually combine these two regexes into a single one, like so:
<cfset arrRows = str.split( '\s*\n\s*' ) />
The \s will match any whitespace character - i.e. [\r\n\t ] and thus this combines the removal of spaces and tabs with turning it into an array.
(Note that since it works by looking for newlines, the trim on GeneratedContent is necessary for any preceeding/trailing whitespace to be removed.)

Related

How to replace ' character in coldfusion

That's the problem I am having as the system doesn't let me do this kind of replacement as it's used to delimitate the string to replace.
I think you string is escaped so that coldusion is not finding the characters you want to replace.
You can try something like
<cfset StrUtils = createObject("java", "org.apache.commons.lang.StringEscapeUtils") />
<cfset result = Replace( StrUtils.unescapeHTML( your_str ), "'", "*", "ALL") />

Using IsValid function to validate text

I' am using IsValid here is the documentation. Below is the code where am trying to validate only text and space in the textfield with ColdFusion.
Well this doesn't work, what am missing here or is their any other function available for easy use. It should allow only alphabetical and space
<cfif isdefined("Form.txtname")
and Form.txtname eq ""
or Form.txtname eq "Enter your name"
or FindNoCase("http://",Form.txtname) neq 0
or IsValid("regex", Form.txtname, "[A-Z][a-z] +") eq 1>
If you want to validate only alphabetical text and spaces, your regex should be
^[a-zA-Z ]*$
the * will allow empty textfield (so no need for eq "" anymore)
^$ are anchors, that match respectively the beginning and the end of the string. They make sure there's only what you want in the textfield.

Regex to remove last letter of each word in list

I list that I have created in coldfusion. Lets use the following list as an example:
<cfset arguments.tags = "battlefieldx, testx, wonderful, ererex">
What I would like to do is remove the "x" from the words that have an x at the end and keep the words in the list. Order doesn't matter. A regex would be fine or looping with coldfusion would be okay too.
Removing x from end of each list element...
To remove all x characters that preceed a comma or the end of string, do:
rereplace( arguments.tags , "x(?=,|$)" , "" , "all" )
The (?= ) part here is a lookahead - it matches the position of its contents, but does not include them in what is replaced. The | is alternation - it'll try to match a literal , and if that fails it'll try to match the end of the string ($).
If you don't want to remove a lone x from, e.g. "x,marks,the,spot"...
If you want to make sure that x is at the end of a word (i.e. is not alone), you can use a non-word boundary check:
rereplace( arguments.tags , "\Bx(?=,|$)" , "" , "all" )
The \B will not match if there isn't a [a-zA-Z0-9_] before the x - for more complex/accurate rules on what constitutes "end of a word", you would need a lookbehind, which can't be done with rereplace, but is still easy enough by doing:
arguments.tags.replaceAll("(?<=\S)x(?=,|$)" , "" )
(That looks for a single non-whitespace character before the x to consider it part of a word, but you can put any limited-width expression within the lookbehind.)
Obviously, to do any letter, switch the x with [a-zA-Z] or whatever is appropriate.
The regex to grab the 'x' from the end of a word is pretty straightforward. Supposing you have a given element as a string, the regex you need is simply:
REReplace(myString, "x$", "")
This matches an x at the end of the given string and replaces it with an empty string.
To do this for each substring in a comma-delimited list, try:
REReplace(myString, "x,|x$", ",", "ALL")
REReplace(myString, "x$", "")
The $ symbol is going to be used to detect the end of the string. Thus detecting an 'x' at the end of your string. The empty quotes will replace it with nothing, thus removing the 'x'.
This has already been answered, but thought I'd post a ColdFusion only solution since you said you could use either. (The RegEx is obviously much easier, but this will work too)
<cfset arguments.tags = "battlefieldx, testx, wonderful, ererex">
<cfset temparray = []>
<cfloop list="#arguments.tags#" index="i">
<cfif right(i,1) EQ 'X'>
<cfset arrayappend(temparray,left(i,len(i) - 1))>
<cfelse>
<cfset arrayappend(temparray,i)>
</cfif>
</cfloop>
<cfset arguments.tags = arraytolist(temparray)>
If you have ColdFusion 9+ or Railo you can simplify the loop using a ternary operator
<cfloop list="#arguments.tags#" index="i">
<cfset cfif right(i,1) EQ 'X' ? arrayappend(temparray,left(i,len(i) - 1)) : arrayappend(temparray,i)>
</cfloop>
You could also convert arguments.tags to an array and loop that way
<cfloop array="#listtoarray(arguments.tags)#" index="i">
<cfset cfif right(i,1) EQ 'X' ? arrayappend(temparray,left(i,len(i) - 1)) : arrayappend(temparray,i)>
</cfloop>

Coldfusion ReReplace "&" but not htmlspecialchars

I need to replace all & with with & in a string like this:
Übung 1: Ü & Ä
or in html
Übung 1: Ü & Ä
Like you see htmlspecialchars in the string (but the & is not displayed as &), so I need to exclude them from my replace. I'm not so familiar with regular expressions. All I need is an expression that does the following:
Search for & that does either follow a (space) or does not follow some chars, excluding a space, which are ending with a ;. then replace that with &.
I tried something like this:
<cfset data = ReReplace(data, "&[ ]|[^(?*^( ));]", "&", "ALL") />
but that replaces every char with the $amp;... ^^'
Sorry, I really don't get that regex things.
Problem with existing attempt
The reason your attempted pattern &[ ]|[^(?*^( ));] is failing is primarily because you have a | but no bounding container - this means you are replacing &[ ] OR [^(?*^( ));] - and that latter will match most things - you are also misunderstanding how character classes work.
Inside [..] (a character class) there are a few simple rules:
if it starts with a ^ it is negated, otherwise the ^ is literal.
if there is a hyphen it is treated as a range (e.g. a-z or 1-5 )
if there is a backslash, it either marks a shorthand class (e.g. \w), or escapes the following character (inside a char class this is only required for [ ] ^ - \).
you are only matching a single character (subject to any qualifiers); there is no ordering/sequence inside the class, and duplicates of the same character are ignored.
Also, you don't need to put a space inside a character class - a literal space works fine (unless you are in free-spacing comment mode, which needs to be explicitly enabled).
Hopefully that helps you understand what was going wrong?
As for actually solving your problem...
Solution
To match an ampersand that does not start a HTML entity, you can use:
&(?![a-z][a-z0-9]+;|#(?:\d+|x[\dA-F]+);)
That is, an ampersand, followed by a negative lookahead for either of:
a letter, then a letter or a number, the a semicolon - i.e. a named entity reference
a hash, then either a number, or an x followed by a hex number, and finally a semicolon - i.e. a numeric entity reference.
To use this in CFML, to replace & with & would be:
<cfset data = rereplaceNoCase( data , '&(?![a-z][a-z0-9]+;|##(?:\d+|x[\dA-F]+);)' , '&' , 'all' ) />
I think it would be easier to simply replace all occurrences of & with &, and then replace the wrongly replaced ones again:
<cfset data = ReReplace(ReReplace(data, "&", "&", "ALL"), "&([^;&]*;)", "&\1", "ALL") />
I haven't tested this in ColdFusion (since I have no clue how to), but it should work, because in JavaScript, the regex itself works:
var s = "I we&nt out on 1 se&123;p 2012 and& it was be&tter & than 15 jan 2012"
console.log(s.replace(/&/g, '&').replace(/&([^;&]*;)/g, '&$1'));
//"I we&nt out on 1 se&123;p 2012 and& it was be&tter & than 15 jan 2012"
So I assume the regex will also do its trick in CF.
The other option you have is to not use REGEX at all. For the sample string you listed, you are simply tying to replace the html ampersand ("&"), without affecting the html entities.
This can be accomplished just using REPLACE.
Remember that when using entities, there will be no spaces around the ampersand character, where as to convert an ampersand character to an HTML entity, there is typically a leading and trailing space. REPLACE will find every case of " & " and update, without affecting any of the "&Uuml" strings (e.g. no leading and trailing space).
<cfset html = "Übung 1: Ü & Ä">
<cfset parsedHtml = REPLACE(html," & ", " & ","All")>
For performance & issues free, just go with Decimal code point like so...
<cfset html = Replace(html, Chr(38), "&", "all")>

how to trim a string without any spaces

How do I remove spaces and other whitespace characters from inside a string. I don't want to remove the space just from the ends of the string, but throughout the entire string.
You can use a regular expression
<cfset str = reReplace(str, "[[:space:]]", "", "ALL") />
You can also simply use Coldfusion's Replace() (if you don't want to use regular expressions for some reason - but don't forget the "ALL" optional parameter.
I've run into this in the past, trying to remove 5 spaces in the middle of a string - I would do something like:
<cfset str = Replace(str, " ", "")/>
Forgetting the "ALL" will only replace the first occurrence so I would end up with 4 spaces, if that makes sense.
Be sure to use:
<cfset str = Replace(str, " ", "", "ALL")/>
to replace multiple spaces. Hope this helps!