trying rereplace and replace to modify some data - regex

I am suffering from regex illness, i am taking medicines but nothing happening, now i am stuck again with this issue
<cfset Change = replacenocase(mytext,'switch(cSelected) {',' var x = 0;while(x < cSelected.length){switch(cSelected[x]) {','one')>
this did not changed anything
i tried Rereplace too
<cfset Change = rereplacenocase(mytext,'[switch(cSelected) {]+',' var x = 0;while(x < cSelected.length){switch(cSelected[x]) {','one')>
this created weird results

Parentheses, square brackets, and curly brackets are special characters in any implementation of RegEx. Wrapping something in [square brackets] means any of the characters within so [fifty] would match any of f,i,t,y. The plus sign after it just means to match any of these characters as many times as possible. So yes [switch(cSelected) {]+ would replace switch(cSelected) {, but it would also replace any occurrence of switch, or s, or w, or the words this or twitch() because each character in these is represented in your character class.
As a regex, you would instead want (switch\(cSelected\) \{) (the + isn't useful here, and we have to escape the parentheses that we want literally represented. It is also a good idea to escape curly braces because they have special meaning in parts of regex and I believe that when you're new to regex, there's no such thing as over-escaping.
(switch # Opens Capture Group
# Literal switch
\(cSelected # Literal (
# Literal cSelected
\) # Literal )
# single space
\{ # Literal {
) # Closes Capture Group
You can also try something like (switch\(cSelected\)\s*\{), using the token \s* to represent any number of whitespace characters.
(switch # Opens CG1
# Literal switch
\(cSelected # Literal (
# Literal cSelected
\) # Literal )
\s* # Token: \s for white space
# * repeats zero or more times
\{ # Literal {
) # Closes CG1
What's needed, and the reason people can't be of much assistance is an excerpt from what you're trying to modify and more lines of code.
Potential reasons that the non-regex ReplaceNoCase() isn't working is either that it can't make the match it needs, which could be a whitespace issue, or it could be that you have two variables setting Change to an action based on the mytext variable..

Related

Regular expression for removing white space and comments from free-spacing and comments mode regular expressions

I'm writing a PCRE regular expression for the purpose of 'minifying' other PCRE regular expressions written in free-spacing and comments mode (/x flag), such as:
# Match a 20th or 21st century date in yyyy-mm-dd format
(19|20)\d\d # year (group 1)
[- /.] # separator - dash, space, slash or period
(0[1-9]|1[012]) # month (group 2)
[- /.] # separator - dash, space, slash or period
(0[1-9]|[12][0-9]|3[01]) # day (group 3)
Note: I've intentionally omitted any regular expression delimiters and x flag
The result of 'minifying' the above expression should be that all literal whitespace characters (including new lines) and comments are removed, except literal spaces within a character class (e.g. [- /.]) and escape whitespace characters (e.g. \):
(19|20)\d\d[- /.](0[1-9]|1[012])[- /.](0[1-9]|[12][0-9]|3[01])
This is the regular expression I have so far, itself written in free-spacing and comments mode (https://regex101.com/r/RHnyWw/2/):
(?<!\\)\s # Match any non-escaped whitespace character
|
(?<!\\)\#.*\s*$ # Match comments (any text following non-escaped #)
Assuming I substitute all matches with empty string, the result is:
(19|20)\d\d[-/.](0[1-9]|1[012])[-/.](0[1-9]|[12][0-9]|3[01])
This is close, except that the space characters with the separator [- /.] parts of the pattern have lost the literal space.
How can I change this pattern so that literal space (and #) characters with [ and ] are preserved?
May be this regex can help
(?:\[(?:[^\\\]]++|\\.)*+\]|\\.)(*SKIP)(*F)|\#.*?$|\s++
Here's my solution:
# Match any literal whitespace character, except when within a valid character class
# at first position, or second position after `-`
(?<!\\|(?<!\\)\[|(?<!\\)\[-)\s
|
# Match comments (any text following a literal # until end-of-line), except when
# within a character class at first position, or second position after `-` or third
# position after `- `
(?<!\\|(?<!\\)\[|(?<!\\)\[-|(?<!\\)\[\ |(?<!\\)\[-\ )\#.*$\r?\n?
The results of of minifying itself are:
(?<!\\|(?<!\\)\[|(?<!\\)\[-)\s|(?<!\\|(?<!\\)\[|(?<!\\)\[-|(?<!\\)\[\ |(?<!\\)\[-\ )\#.*$\r?\n?
https://regex101.com/r/3EVpuH/1
An advantage of this solution is that it doesn't depend on backtracking control verbs (which I'd not heard of until I looked into after seeing Michail's solution).
The disadvantage (over Michail's solution) is that if you want to specify a dash, space and/or # characters within a character class, they must appear in a specific order: dash, space then hash i.e. [- #]. I don't know is this requirement can be eliminated without using control verbs.

Make sure the presence of a particular character in a string

I'm using the following regex to parse my application log file to search for particular string
\[\s*\b(?:[0-9A-Za-z][0-9A-Za-z-_.#]{0,200})(?:\.(?:[0-9*A-Za-z][0-9A-Za-z-_.#]{0,200}))*(\.?|\b)\s*]
This works fine but now we need to make sure that the string "must" contain "-" character to match. I'm confused to add this condition to the original regx.
Any pointers would be helpful.
Thanks and Regards,
Santhosh
The regex matches a string inside square brackets, [ and ], and may only consist of non-[ and non-] symbols.
You can easily add a positive lookahead restriction after the opening [ like check if the next characters other than ] and [ are followed with -:
\[ # opening [
(?=[^\]\[]*-) # There must be a hyphen in [...]
\s* # 0+ whitespaces
\b(?:[0-9A-Za-z][0-9A-Za-z-_.#]{0,200}) # Part 1 (with obligatory subpattern)
(?:\. # Part 2, optional
(?:[0-9*A-Za-z][0-9A-Za-z-_.#]{0,200})
)*
(\.?|\b) # optional . or word boundary
\s* # 0+ whitespaces
] # closing ]
See the regex demo
And a one-liner:
\[(?=[^\]\[]*-)\s*\b(?:[0-9A-Za-z][0-9A-Za-z-_.#]{0,200})(?:\.(?:[0-9*A-Za-z][0-9A-Za-z-_.#]{0,200}))*(\.?|\b)\s*]
Tip: use the verbose /x modifier to split the pattern into separate multiline blocks for analysis, it will help you in the future when you need to modify the pattern again.
If you need to match only if - or # is present inside [...], modify the lookahead as (?=[^\]\[]*[-#]). For a more general case, use (?=[^\]\[]*(?:one|another|must-be-present)) alternatives inside an additional group inside the lookahead.
Updated answer - Assertion
In this case, the better way to do it is to use an assertion consisting of checking only the position's expected to match the character in question.
I know it's simple, but using the outter pseudo-anchor text \[ ... \] as
a delimiter that cannot exist in the body is a rarity.
You should always try to avoid doing it like this.
Things change, your input could change.
The rule to follow in validation of known characters that are Mid-String is to use only them
when using an assertion validator.
This avoids the necessity of relying on what is not there at the moment ie, not a ],
but should rely on what is there.
Again, this pertains to mid-string matching.
BOL/EOL is a different thing entirely ^$, and is a more permanent construct
with which to leverage.
It's always better to code smarter.
\[\s*\b(?=[0-9A-Za-z][0-9A-Za-z_.#]{0,199}-|[0-9A-Za-z][0-9A-Za-z_.#]{0,200}(?:\.[0-9*A-Za-z][0-9A-Za-z_.#]{0,200})*\.[0-9*A-Za-z][0-9A-Za-z_.#]{0,199}-)(?:[0-9A-Za-z][0-9A-Za-z_.#-]{0,200})(?:\.(?:[0-9*A-Za-z][0-9A-Za-z_.#-]{0,200}))*(\.?|\b)\s*\]
Using Conditionals
If your engine supports conditionals, the easy way is to not rely on a fluke
of pseudo anchor text, ie. [..].
\[\s*\b[0-9A-Za-z](?:[0-9A-Za-z_.#]|(-)){0,200}(?:\.(?:[0-9*A-Za-z](?:[0-9A-Za-z_.#]|(-)){0,200}))*(\.?|\b)\s*\](?(1)|(?(2)|(?!)))
Expanded
\[ \s* \b
[0-9A-Za-z]
(?:
[0-9A-Za-z_.#]
| ( - ) # (1)
){0,200}
(?:
\.
(?:
[0-9*A-Za-z]
(?:
[0-9A-Za-z_.#]
| ( - ) # (2)
){0,200}
)
)*
( \.? | \b ) # (3)
\s* \]
(?(1) # Fail if no dash found
| (?(2)
| (?!)
)
)
This conditional would work if just want to make sure that - occurs within your string before running your block of code.
if (myString.indexOf('-') >= 0) {
//your code
}
If you have to have a single hyphen, you'll have to either repeat most of the pattern, or check for it in a second phase:
if re.match(pattern, line):
if not '-' in line:
raise MissingDash('No dash in line: {}'.format(line))
I'd suggest adding the second check, since adding the requirement to the regex would make it even more horrible to read.

How does this regex for FQDNs (excluding.arpa) work?

I am trying to understand how regex works. I understand it little by little. However, I don't understand this one completely. It's basically a regex for fully qualified domain names but a requirement is that the ending can't be .arpa.
(?=^.{4,253}$)(^([a-zA-Z0-9]{1,63}\.)+[a-zA-Z]{2,63}[^.arpa]$)
https://regex101.com/r/hU6tP0/3
This doesn't match google.uk. If I change it to:
(?=^.{4,253}$)(^([a-zA-Z0-9]{1,63}\.)+[a-zA-Z]{1,63}[^.arpa]$)
It works again.
But this works as well
(?=^.{4,253}$)(^([a-zA-Z0-9]{1,63}\.)+[a-zA-Z]{2,63}$)
Here is my thought process for
?=^.{4,253}$)(^([a-zA-Z0-9]{1,63}\.)+[a-zA-Z]{2,63}[^.arpa]$)
I see it as this
(?=
Is a positive look ahead (Can someone explain to me what this actually means?) As I understand it now, it just means that the string needs to match the regex.
^.{4,253}$)
Match all characters but it needs to be between 4 and 253 characters long.
(^([a-zA-Z0-9]{1,63}\.)
Start a capture group and make another capture group within. This capture group says that every non special character can be written 1 to 63 times or till the . is written.
+
The previous capture group can be repeated indefinitely, but it should always end with a .. This way the next capture group is started.
[a-zA-Z]{2,63}
Then as many times as you want you can write a to z with upper, but it needs to be between 2 and 63.
[^.arpa]$)
The last characters can't be .arpa.
Can someone tell me where I am going wrong?
This doesn't do what you think it does:
[^.arpa]
All that says is 'ends with something that isn't one of the letter apr.' - it's a negated character class.
You might be thinking of a negative lookahead assertion:
(?!\.arpa)$
But if you're trying to compound multiple criteria in a regex, I'd suggest you're probably using the wrong tool for the job. It ends up complicated and hard to debug, thanks to greedy/non-greedy matching, etc.
Your 'positive/negative' lookaheads are to match a piece of a pattern that aren't surrounded by other pieces of pattern. But that can have some unexpected outcomes if you're matching variable widths, because the regex engine will backtrack until it finds something that matches.
A simpler example:
([\w.]+)(?!arpa)$
Applied to:
www.test.arpa
Will it match? What's in the group?
... it will match, because [\w\.]+ will consume all of it, and then the lookahead won't "see" anything.
If you use:
([\w]+)\.(?!arpa)
Instead though - you'll capture.... www, but you won't match test (with e.g. g flag, because the www doesn't have .arpa after it, but the test does.
https://regex101.com/r/hU6tP0/5
It really does get complicated using negative assertions in a pattern as a result. I'd suggest simply not doing so, and applying two separate tests. It's hard for you to figure out, and it's hard for a future maintenance programmer too!
This is an analysis of your regex:
(?=^.{4,253}$) # force min length: 4 chars, max length: 253 chars
( # Capturing Group 1 (CG1) - not needed
^ # Match start of the string
( # CG2 (can be a non capturing group '(?:...)')
[a-zA-Z0-9]{1,63} # any sequence of letters and numbers with length between 1 and 63
\. # a literal dot
)+ # CLOSE CG2
[a-zA-Z]{1,63} # any letter sequence with length between 1 to 63
[^.arpa] # a negated char class: any char that is not a "literal" '.','a','r','p' (last 'a' is redundant)
$ # end of the string
) # CLOSE CG1
To avoid the tail of the string to be .arpa you need to use a negative lookahead (?!...), so modify just like this:
(?=^.{4,253}$)(?!.*\.arpa$)(^([a-zA-Z0-9]{1,63}\.)+[a-zA-Z]{2,63}$)
An online demo
Update:
I've upgraded the regex to rationalise it (i've incorporated also the Sobrique suggestion adding an important details):
/^(?=.{4,253}$)([a-z0-9]{1,63}[.])+(?!arpa$)[a-z]{2,63}$/i
Compact version online demo
Legenda
/ # js regex delimiter
^ # start of the string
(?=.{4,253}$) # force min length: 4 chars, max length: 253 chars
(?: # Non capturing group 1 (NCG1)
[a-z0-9]{1,63} # any letter or digit in a sequence with length from 1 to 63 chars
[.] # a literal dot '.' (more readable than \.)
)+ # CLOSE NCG1 - repeat its content one or more time
(?!arpa$) # force that after the last literal dot '.' the string does not end with 'arpa' (i've added '$' to Sobrique suggestion instead it prevents also '.arpanet' too)
[a-z]{2,63} # a sequence of letters with length from 2 to 63
$ # end of the string
/i # Close the regex delimiter and add case insensitive flag [a-z] match also [A-Z] and viceversa
var re = /^(?=.{4,253}$)([a-z0-9]{1,63}[.])+(?!arpa$)[a-z]{2,63}$/i;
var tests = ['google.uk','domain.arpa','domain.arpa2','another.domain.arpa.net','domain.arpanet'];
var m;
while(t = tests.pop()) {
document.getElementById("r").innerHTML += '"' + t + '"<br/>';
document.getElementById("r").innerHTML += 'Valid domain? ' + ( (t.match(re)) ? '<font color="green">YES</font>' : '<font color="red">NO</font>') + '<br/><br/>';
}
<div id="r"/>

RegEx to match some wrapped texts

Consider following text:
aas( I)f df (as)(dfdsf)(adf).dgdf(sfg).(dfdf) asdfsdf dsfa(asd #54 54 !fa.) sdf
I want to retrive text between parenthesis, but adjacent parentheses should be consider a single unit. How can I do that?
For above example desired output is:
( I)
(as)(dfdsf)(adf)
(sfg).(dfdf)
(asd #54 54 !fa.)
Assumption
No nesting (), and no escaping of ()
Parentheses are chained together with the . character, or by being right next to each other (no flexible spacing allowed).
(a)(b).(c) is consider a single token (the . is optional).
Solution
The regex below is to be used with global matching (match all) function.
\([^)]*\)(?:\.?\([^)]*\))*
Please add the delimiter on your own.
DEMO
Explanation
Break down of the regex (spacing is insignificant). After and including # are comments and not part of the regex.
\( # Literal (
[^)]* # Match 0 or more characters that are not )
\) # Literal ). These first 3 lines match an instance of wrapped text
(?: # Non-capturing group
\.? # Optional literal .
\([^)]*\) # Match another instance of wrapped text
)* # The whole group is repeated 0 or more times
I'd go with: /(?:\(\w+\)(?:\.(?=\())?)+/g
\(\w+\) to match a-zA-Z0-9_ inside literal braces
(?:\.(?=\())? to capture a literal . only if it's followed by another opening brace
The whole thing wrapped in (?:)+ to join adjacent captures together
var str = "aas(I)f df (asdfdsf)(adf).dgdf(sfg).(dfdf) asdfsdf dsfa(asdfa) sdf";
str.match(/(?:\(\w+\)(?:\.(?=\())?)+/g);
// -> ["(I)", "(asdfdsf)(adf)", "(sfg).(dfdf)", "(asdfa)"]
try [^(](\([^()]+([)](^[[:alnum:]]*)?[(][^()]+)*\))[^)]. capture group 1 is what you want.
this expression assumes that every kind of character apart from parentheses mayy occur in the text between parentheses and it won't match portions with nested parentheses.
This one should do the trick:
\([A-Za-z0-9]+\)

regex to match RTRIM(LTRIM(xx)) = xx

I am trying to jot down regex to find where I am using ltrim rtrim in where clause in stored procedures.
the regex should match stuff like:
RTRIM(LTRIM(PGM_TYPE_CD))= 'P'))
RTRIM(LTRIM(PGM_TYPE_CD))='P'))
RTRIM(LTRIM(PGM_TYPE_CD)) = 'P'))
RTRIM(LTRIM(PGM_TYPE_CD))= P
RTRIM(LTRIM(PGM_TYPE_CD))= somethingelse))
etc...
I am trying something like...
.TRIM.*\)\s+
[RL]TRIM\s*\( Will look for R or L followed by TRIM, any number of whitespace, and then a (
This what you want:
[LR]TRIM\([RL]TRIM\([^)]+\)\)\s*=\s*[^)]+\)*
?
What's that doing is saying:
[LR] # Match single char, either "L" or "R"
TRIM # Match text "TRIM"
\( # Match an open parenthesis
[RL] # Match single char, either "R" or "L" (same as [LR], but easier to see intent)
TRIM # Match text "TRIM"
\( # Match an open parenthesis
[^)]+ # Match one or more of anything that isn't closing parenthesis
\)\) # Match two closing parentheses
\s* # Zero or more whitespace characters
= # Match "="
\s* # Again, optional whitespace (not req unless next bit is captured)
[^)]+ # Match one or more of anything that isn't closing parenthesis
\)* # Match zero or more closing parentheses.
If this is automated and you want to know which variables are in it, you can wrap parentheses around the relevant parts:
[LR]TRIM\([RL]TRIM\(([^)]+)\)\)\s*=\s*([^)]+)\)*
Which will give you the first and second variables in groups 1 and 2 (either \1 and \2 or $1 and $2 depending on regex used).
How about something like this:
.*[RL]TRIM\s*\(\s*[RL]TRIM\s*\([^\)]*)\)\s*\)\s*=\s*(.*)
This will capture the inside of the trim and the right side of the = in groups 1 and 2, and should handle all whitespace in all relevant areas.