Exact string coldfusion regular expression - regex

I am using a regular expression to replace all characters that are not equal to the exact word "NULL" and also keep all digits. I did a first step, by replacing all "NULL" words from my string with this :
<cfset data = ReReplaceNoCase("123NjyfjUghfLL|NULL|NULL|NULL","\bNULL\b","","ALL")>
It removes all instances of the exact "NULL" word, that means it does not remove letters "N", "U" and "L" from the substring "123NjyfjUghfLL". And this is correct. But now, I want to reverse that. I want to keep only "NULL" word, meaning that it removes single "L", "U" and "L". So I tried that :
<cfset data = ReReplaceNoCase("123NjyfjUghfLL|NULL|NULL|NULL","[^\bNULL\b]","","ALL")>
But now this keeps all "N", "U" and "L" letters, so it outputs "NULLNULLNULLNULL". There should be only 3 times "NULL".
Can someone help me with this please? And where to add the extra code to keep digits? Thank you.

You can do this
<cfset data = ReReplaceNoCase("123NjyfjUghfLL|NULL|NULL|NULL","(^|\|)(?!NULL(?:$|\|))([^|]*)(?=$|\|)","\1","ALL")>
(^|\|)(?!NULL(?:$|\|))([^|]*)(?=$|\|)
Explanation:
( # Opens Capture Group 1
^ # Anchors to the beginning to the string.
| # Alternation (CG1)
\| # Literal |
) # Closes CG1
(?! # Opens Negative Lookahead
NULL # Literal NULL
(?: # Opens Non-Capturing group
$ # Anchors to the end to the string.
| # Alternation (NCG)
\| # Literal |
) # Closes NCG
) # Closes NLA
( # Opens Capture Group 2
[^|]* # Negated Character class (excludes the characters within)
# None of: |
# * repeats zero or more times
) # Closes CG2
(?= # Opens LA
$ # Anchors to the end to the string.
| # Alternation (LA)
\| # Literal |
) # Closes LA
Regex101.com demo
Lastly, some insight about character classes (content between square brackets)
What [^\bNULL\b] means is
[^\bNULL\b] # Negated Character class (excludes the characters within)
# None of: \b,N,U,L
# When \b is inside a character class, it matches a backspace character.
# Outside of a character class, \b matches a word boundary as you use it in your first code.
Character classes are not designed for matching or ignoring words, they're designed for permitting or excluding characters or ranges of characters.
Edit:
Ok so it works well. But what if I would like to keep also the digits? I am a kind of lost in this line of code and I cannot find where to put extra code... I think the extra code would be [^0-9] right?
This regex (demo) works to also permit numbers of any length where the number is the entire value
(^|\|)(?!(?:NULL|[0-9]+)(?:$|\|))([^|]*)(?=$|\|)
You can also use this regex (demo) to permit numbers with a decimal value.
(^|\|)(?!(?:NULL|[0-9]+(?:\.[0-9]+)?)(?:$|\|))([^|]*)(?=$|\|)

Related

How to do text wrapping without adding newline if the residue is short?

Description
Say I have a lot of strings, some of them are very long:
Aim for the moon. If you miss, you may hit a star. ā€“ Clement Stone
Nothing about us without us
I want to have a text wrapper doing this algorithm:
Starting from the beginning of the string, identify the nearest blank character ( ) that around position 25
If the residue is smaller than 5 character-length, then do nothing. If not, replace that blank character with \n
Identify the next nearest blank character in the end of the next 25 characters
Return to 2 until end of line
So that text will be replaced to:
Aim for the moon. If you\nmiss, you may hit a star.\nā€“ Clement Stone
Nothing about us without us
Attempt 1
Consulting Wrapping Text With Regular Expressions
Matching pattern: (.{1,25})( +|$\n?)
Replacing pattern: $1\n
But this will produce Nothing about us without\nus, which is not preferable.
Attempt 2
Using a Lookahead Construct in a If-Then-Else Conditionals:
Matching pattern: (.{1,25})(?(?=(.{1,5}$).*))( +|$\n?)
Replacing pattern: $1$2\n
It still produce Nothing about us without\nus, which is not preferable.
Created this based on #sln 's? answer to a different word wrap problem.
All I have added is this alternative point to add a line break:
"Expand by up to 5 characters until before a linebreak or EOS"
and changed the number of characters allowed from 50 to 25
[^\r\n]{1,5}(?=\r?\n|$)
Compressed
(?:((?>.{1,25}(?:[^\r\n]{1,5}(?=\r?\n|$)|(?<=[^\S\r\n])[^\S\r\n]?|(?=\r?\n)|$|[^\S\r\n]))|.{1,25})(?:\r?\n)?|(?:\r?\n|$))
Replacement
$1 followed by a linebreak
$1\r\n
Preview
https://regex101.com/r/pRqdhi/1
Detailed Regular Expression
(?:
# -- Words/Characters
( # (1 start)
(?> # Atomic Group - Match words with valid breaks
.{1,25} # 1-N characters
# Followed by one of 4 prioritized, non-linebreak whitespace
(?: # break types:
[^\r\n]{1,5}(?=\r?\n|$) # Expand by up to 5 characters until before a linebreak or EOS
|
(?<= [^\S\r\n] ) # 1. - Behind a non-linebreak whitespace
[^\S\r\n]? # ( optionally accept an extra non-linebreak whitespace )
| (?= \r? \n ) # 2. - Ahead a linebreak
| $ # 3. - EOS
| [^\S\r\n] # 4. - Accept an extra non-linebreak whitespace
)
) # End atomic group
|
.{1,25} # No valid word breaks, just break on the N'th character
) # (1 end)
(?: \r? \n )? # Optional linebreak after Words/Characters
|
# -- Or, Linebreak
(?: \r? \n | $ ) # Stand alone linebreak or at EOS
)
If your input is run line-by-line, and there is no newline character in the middle of a line, then you can try this:
Pattern: (.{1,25}.{1,5}$|.{1,25}(?= ))
Substitution: $1\n
Then apply this:
Pattern: \n
Substitution: \n

Regex pattern without one case

I would like to remove some strings from filename.
I want to remove every string in bracket but not if there is a string "remix" or "Remix" or "REMIX"
Now I have got
sed "s/\s*\(\s?[A-z0-9. ]*\)//g"
but how to exclude cases when there is remix in string?
You can use a capture group:
sed 's/\(\s*([^)]*remix[^)]*)\)\|\s*(\s\?[a-z0-9. ]*)/\1/gi'
When the "remix branch" doesn't match, the capture group is not defined and the matched part is replaced with an empty string.
When the "remix branch" succeeds, the matched part is replaced by the content of the capture group, so by itself.
Note: if that helps to avoid false positive, you can add word-boundaries around "remix": \bremix\b
pattern details:
\( # open the capture group 1
\s* # zero or more white-spaces
( # a literal parenthesis
[^)]* # zero or more characters that are not a closing parenthesis
remix
[^)]*
)
\) # close the capture group 1
\| # OR
# something else between parenthesis
\s* # note that it is essential that the two branches are able to
# start at the same position. If you remove \s* in the first
# branch, the second branch will always win when there's a space
# before the opening parenthesis.
(\s\?[a-z0-9. ]*)
\1 is the reference to the capture group 1
i makes the pattern case-insensitive
[EDIT]
If you want to do it in a POSIX compliant way, you must use a different approach because several Gnu features are not available, in particular the alternation \| (but also the i modifier, the \s character class, the optional quantifier \?).
This other approach consists to find all eventual characters that are not an opening parenthesis and all eventual substrings enclosed between parenthesis with "remix" inside, followed by eventual white-spaces and an eventual substring enclosed between parenthesis.
As you can see all is optional and the pattern can match an empty string, but it isn't a problem.
All before the parenthesis part to remove is captured in group 1.
sed 's/\(\([^(]*([^)]*[Rr][Ee][Mm][Ii][Xx][^)]*)[^ \t(]*\([ \t]\{1,\}[^ \t(]\{1,\}\)*\)*\)\([ \t]*([^)]*)\)\{0,1\}/\1/g;'
pattern details:
\( # open the capture group 1
\(
[^(]* # all that is not an opening parenthesis
# substring enclosed between parenthesis without "remix"
( [^)]* [Rr][Ee][Mm][Ii][Xx] [^)]* )
# Let's reach the next parenthesis without to match the white-spaces
# before it (otherwise the leading white-spaces are not removed)
[^ \t(]* # all that is not a white-space or an opening parenthesis
# eventual groups of white-spaces followed by characters that are
# not white-spaces nor opening parenthesis
\( [ \t]\{1,\} [^ \t(]\{1,\} \)*
\)*
\) # close the capture group 1
\(
[ \t]* # leading white-spaces
([^)]*) # parenthesis
\)\{0,1\} # makes this part optional (this avoid to remove a "remix" part
# alone at the end of the string)
The word boundaries in this mode aren't available too. So the only way to emulate them is to list the four possibilities:
([Rr][Ee][Mm][Ii][Xx]) # poss1
([Rr][Ee][Mm][Ii][Xx][^a-zA-Z][^)]*) # poss2
([^)]*[^a-zA-Z][Rr][Ee][Mm][Ii][Xx]) # poss3
([^)]*[^a-zA-Z][Rr][Ee][Mm][Ii][Xx][^a-zA-Z][^)]*) # poss4
and to replace ([^)]*[Rr][Ee][Mm][Ii][Xx][^)]*) with:
\(poss1\)\{0,\}\(poss2\)\{0,\}\(poss3\)\{0,\}\(poss4\)\{0,\}
Just skip the lines matching "remix":
sed '/([^)]*[Rr][Ee][Mm][Ii][Xx][^)]*)/! s/([^)]*)//g'
where bracket are (US) :[]
sed '/remix\|REMIX\|Remix/ !s/\[[^]]*]//g'
where bracet (ROW): ()
sed '/remix\|REMIX\|Remix/ !s/([^)]*)//g'
assuming:
- there is no internal bracket
- Other form of remix are excluced (ReMix, ...), so line is deleted
- Remix could be any place in title (i love remix) [if needed specify which to take and remove]

RegEx to replace prefix and postfix

I would like to build a RegEx expression to replace the prefix and postfix of a string. the general string is built from
a known prefix string
some letter a-z or A-Z
some unknown string with letters, hyphens, backslash, slash and numbers.
a hyphen
an integer number
the symbols #.
some string of letters
Examples:
KnownStringr/df-2e\d-3724#.Gkjsu
KnownStringEd\e4v-bn-824#.YKfg
KnownStringa-YK224E\yy-379924#.awws
I would like to replace the prefix and postfix of the NUMBER so that I get:
MyPrefix3724MyPostfix
MyPrefix824MyPostfix
MyPrefix379924MyPostfix
This regex should do the trick, but you always should specify the language/framework you're using, because not all regex engines support the same features.
The number that you want to capture would be in capture group #3 ((\d+)), which most languages reference as \3
(?:KnownString)([a-zA-Z])(.*?)-(\d+)\#\.[a-zA-Z]+
Explanation:
(?: # Opens NCG
KnownString # Literal KnownString
) # Closes NCG
( # Opens CG1
[a-zA-Z] # Character class (any of the characters within)
# Anything between a and z
# Anything between A and Z
) # Closes CG1
( # Opens CG2
.*? # . denotes any single character, except for newline
# * repeats zero or more times
# ? as few times as possible
)- # Closes CG2
# Literal -
( # Opens CG3
\d+ # Token: \d (digit)
# + repeats one or more times
) # Closes CG3
\# # Literal #
\. # Literal .
[a-zA-Z]+ # Character class (any of the characters within)
# Anything between a and z
# Anything between A and Z
# + repeats one or more times
You haven't specified what the known prefix is, you should be careful to escape special characters in known string, especially period, plus sign, asterisk, question mark, and parentheses.

use of colon symbol in regular expression

I am new to regex. I am studying it in regularexperssion.com. The question is that I need to know what is the use of a colon (:) in regular expressions.
For example:
$pattern = '/^(([\w]+:)?\/\/)?(([\d\w]|%[a-fA-f\d]{2,2})+(:([\d\w]|%[a-fA-f\d]{2,2})+)?#)?([\d\w][-\d\w]{0,253}[\d\w]\.)+[\w]{2,4}(:[\d]+)?(\/([-+_~.\d\w]|%[a-fA-f\d]{2,2})*)*(\?(&?([-+_~.\d\w]|%[a-fA-f\d]{2,2})=?)*)?(#([-+_~.\d\w]|%[a-fA-f\d]{2,2})*)?$/';
which matches:
$url1 = "http://www.somewebsite.com";
$url2 = "https://www.somewebsite.com";
$url3 = "https://somewebsite.com";
$url4 = "www.somewebsite.com";
$url5 = "somewebsite.com";
Yeah, any help would be greatly appreciated.
Colon : is simply colon. It means nothing, except special cases like, for example, clustering without capturing (also known as a non-capturing group):
(?:pattern)
Also it can be used in character classes, for example:
[[:upper:]]
However, in your case colon is just a colon.
Special characters used in your regex:
In character class [-+_~.\d\w]:
- means -
+ means +
_ means _
~ means ~
. means .
\d means any digit
\w means any word character
These symbols have this meaning because they are used in a symbol class [].
Without symbol class + and . have special meaning.
Other elements:
=? means = that can occur 0 or 1 times; in other words = that can occur or not, optional =.
I've decided to go you one better and explain the entire regex:
^ # anchor to start of line
( # start grouping
( # start grouping
[\w]+ # at least one of 0-9a-zA-Z_
: # a literal colon
) # end grouping
? # this grouping is optional
\/\/ # two literal slashes
) # end capture
? # this grouping is optional
(
(
[\d\w] # exactly one of 0-9a-zA-Z_
# having \d is redundant
| # alternation
% # literal % sign
[a-fA-f\d]{2,2} # exactly 2 hexadecimal digits
# should probably be A-F
# using {2} would have sufficed
)+ # at least one of these groups
( # start grouping
: # literal colon
(
[\d\w]
|
%
[a-fA-f\d]{2,2}
)+
)? # Same grouping, but it is optional
# and there can be only one
# # literal # sign
)? # this group is optional
(
[\d\w] # same as [\w], explained above
[-\d\w]{0,253} # includes a dash (-) as a valid character
# between 0 and 253 of these characters
[\d\w] # end with \w. They want at most 255
# total and - cannot be at the start
# or end
\. # literal period
)+ # at least one of these groups
[\w]{2,4} # two to four \w characters
(
: # literal colon
[\d]+ # at least one digit
)?
(
\/ # literal slash
(
[-+_~.\d\w] # one of these characters
| # *or*
% # % with two hex digit combo
[a-fA-f\d]{2,2}
)* # zero or more of these groups
)* # zero or more of these groups
(
\? # literal question mark
(
&? # literal &amp or & (semicolon optional)
(
[-+_~.\d\w]
|
%
[a-fA-f\d]{2,2}
)
=? # optional literal =
)* # zero or more of this group
)? # this group is optional
(
# # literal #
(
[-+_~.\d\w]
|
%
[a-fA-f\d]{2,2}
)*
)?
$ # anchor to end of line
It's important to understand what the metacharacters/sequences are. Some sequences are not meta when used in certain contexts (especially a character class). I've cataloged them for you:
meta with no context
^ -- zero width start of line
() -- grouping/capture
? -- zero or one of the preceding sequence
+ -- one or more of the preceding sequence
* -- zero or more of the preceding sequence
[] -- character class
\w -- alphanumeric characters and _. Opposite of \W
| -- alternation
{} -- length assertion
$ -- zero width end of line
This excludes :, #, and % from having any special/meta meaning in the raw context.
meta inside character class
] ends the character class. - creates a range of characters unless it is at the start or the end of the character class or escaped with a backslash.
grouping assertions
A (? combination starts a grouping assertion. For example, (?: means group but do not capture. This means that in the regex /(?:a)/, it will match the string "a", but a is not captured for use in replacement or match groups as it would be from /(a)/.
? can also be used for lookahead/lookbehind assertions with ?=, ?!, ?<=, ?<!. (? followed by any sequence except what I mentioned in this section is just a literal ?.
There is no special use for colon : in your case :
(([\w]+:)?\/\/)? will match http://, https://, ftp://...
You can find one special use for colon : every capturing group starting by (?: won't appear in the results.
Example, with "foobarbaz" in input :
/foo((bar)(baz))/ => { [1] => 'barbaz', [2] => 'bar', [3] => 'baz' }
/foo(?:(bar)(baz))/ => { [1] => 'bar', [2] => 'baz' }
A colon has no special meaning in Regular Expressions, it just matches a literal colon.
[\w]+:
This just means any word character 1 or more times followed by a literal colon
The brackets are actually not needed here. Square brackets are used to define a group of characters to match. So
[abcd]
means a single character of a, b, c, d

Regular expression captures unwanted string

I have created the following expression: (.NET regex engine)
((-|\+)?\w+(\^\.?\d+)?)
hello , hello^.555,hello^111, -hello,+hello, hello+, hello^.25, hello^-1212121
It works well except that :
it captures the term 'hello+' but without the '+' : this group should not be captured at all
the last term 'hello^-1212121' as 2 groups 'hello' and '-1212121' both should be ignored
The strings to capture are as follows :
word can have a + or a - before it
or word can have a ^ that is followed by a positive number (not necessarily an integer)
words are separated by commas and any number of white spaces (both not part of the capture)
A few examples of valid strings to capture :
hello^2
hello^.2
+hello
-hello
hello
EDIT
I have found the following expression which effectively captures all these terms, it's not really optimized but it just works :
([a-zA-Z]+(?= ?,))|((-|\+)[a-zA-Z]+(?=,))|([a-zA-Z]+\^\.?\d+)
Ok, there are some issues to tackle here:
((-|+)?\w+(\^.?\d+)?)
^ ^
The + and . should be escaped like this:
((-|\+)?\w+(\^\.?\d+)?)
Now, you'll also get -1212121 there. If your string hello is always letters, then you would change \w to [a-zA-Z]:
((-|\+)?[a-zA-Z]+(\^\.?\d+)?)
\w includes letters, numbers and underscore. So, you might want to restrict it down a bit to only letters.
And finally, to take into consideration of the completely not capturing groups, you'll have to use lookarounds. I don't know of anyway otherwise to get to the delimiters without hindering the matches:
(?<=^|,)\s*((-|\+)?[a-zA-Z]+(\^\.?\d+)?)\s*(?=,|$)
EDIT: If it cannot be something like -hello^2, and if another valid string is hello^9.8, then this one will fit better:
(?<=^|,)\s*((?:-|\+)?[a-zA-Z]+|[a-zA-Z]+\^(?:\d+)?\.?\d+)(?=\s*(?:,|$))
And lastly, if capturing the words is sufficient, we can remove the lookarounds:
([-+]?[a-zA-Z]+|[a-zA-Z]+\^(?:\d+)?\.?\d+)
It would be better if you first state what it is you are looking to extract.
You also don't indicate which Regular Expression engine you're using, which is important since they vary in their features, but...
Assuming you want to capture only:
words that have a leading + or -
words that have a trailing ^ followed by an optional period followed by one or more digits
and that words are sequences of one or more letters
I'd use:
([a-zA-Z]+\^\.?\d+|[-+][a-zA-Z]+)
which breaks down into:
( # start capture group
[a-zA-Z]+ # one or more letters - note \w matches numbers and underscores
\^ # literal
\.? # optional period
\d+ # one or more digits
| # OR
[+-]? # optional plus or minus
[a-zA-Z]+ # one or more letters or underscores
) # end of capture group
EDIT
To also capture plain words (without leading or trailing chars) you'll need to rearrange the regexp a little. I'd use:
([+-][a-zA-Z]+|[a-zA-Z]+\^(?:\.\d+|\d+\.\d+|\d+)|[a-zA-Z]+)
which breaks down into:
( # start capture group
[+-] # literal plus or minus
[a-zA-Z]+ # one or more letters - note \w matches numbers and underscores
| # OR
[a-zA-Z]+ # one or more letters
\^ # literal
(?: # start of non-capturing group
\. # literal period
\d+ # one or more digits
| # OR
\d+ # one or more digits
\. # literal period
\d+ # one or more digits
| # OR
\d+ # one or more digits
) # end of non-capturing group
| # OR
[a-zA-Z]+ # one or more letters
) # end of capture group
Also note that, per your updated requirements, this regexp captures both true non-negative numbers (i.e. 0, 1, 1.2, 1.23) as well as those lacking a leading digit (i.e. .1, .12)
FURTHER EDIT
This regexp will only match the following patterns delimited by commas:
word
word with leading plus or minus
word with trailing ^ followed by a positive number of the form \d+, \d+.\d+, or .\d+
([+-][A-Za-z]+|[A-Za-z]+\^(?:.\d+|\d+(?:.\d+)?)|[A-Za-z]+)(?=,|\s|$)
Please note that the useful match will appear in the first capture group, not the entire match.
So, in Javascript, you'd:
var src="hello , hello ,hello,+hello,-hello,hello+,hello-,hello^1,hello^1.0,hello^.1",
RE=/([+-][A-Za-z]+|[A-Za-z]+\^(?:\.\d+|\d+(?:\.\d+)?)|[A-Za-z]+)(?=,|\s|$)/g;
while(RE.test(src)){
console.log(RegExp.$1)
}
which produces:
hello
hello
hello
+hello
-hello
hello^1
hello^1.0
hello^.1