use of colon symbol in regular expression - regex

I am new to regex. I am studying it in regularexperssion.com. The question is that I need to know what is the use of a colon (:) in regular expressions.
For example:
$pattern = '/^(([\w]+:)?\/\/)?(([\d\w]|%[a-fA-f\d]{2,2})+(:([\d\w]|%[a-fA-f\d]{2,2})+)?#)?([\d\w][-\d\w]{0,253}[\d\w]\.)+[\w]{2,4}(:[\d]+)?(\/([-+_~.\d\w]|%[a-fA-f\d]{2,2})*)*(\?(&?([-+_~.\d\w]|%[a-fA-f\d]{2,2})=?)*)?(#([-+_~.\d\w]|%[a-fA-f\d]{2,2})*)?$/';
which matches:
$url1 = "http://www.somewebsite.com";
$url2 = "https://www.somewebsite.com";
$url3 = "https://somewebsite.com";
$url4 = "www.somewebsite.com";
$url5 = "somewebsite.com";
Yeah, any help would be greatly appreciated.

Colon : is simply colon. It means nothing, except special cases like, for example, clustering without capturing (also known as a non-capturing group):
(?:pattern)
Also it can be used in character classes, for example:
[[:upper:]]
However, in your case colon is just a colon.
Special characters used in your regex:
In character class [-+_~.\d\w]:
- means -
+ means +
_ means _
~ means ~
. means .
\d means any digit
\w means any word character
These symbols have this meaning because they are used in a symbol class [].
Without symbol class + and . have special meaning.
Other elements:
=? means = that can occur 0 or 1 times; in other words = that can occur or not, optional =.

I've decided to go you one better and explain the entire regex:
^ # anchor to start of line
( # start grouping
( # start grouping
[\w]+ # at least one of 0-9a-zA-Z_
: # a literal colon
) # end grouping
? # this grouping is optional
\/\/ # two literal slashes
) # end capture
? # this grouping is optional
(
(
[\d\w] # exactly one of 0-9a-zA-Z_
# having \d is redundant
| # alternation
% # literal % sign
[a-fA-f\d]{2,2} # exactly 2 hexadecimal digits
# should probably be A-F
# using {2} would have sufficed
)+ # at least one of these groups
( # start grouping
: # literal colon
(
[\d\w]
|
%
[a-fA-f\d]{2,2}
)+
)? # Same grouping, but it is optional
# and there can be only one
# # literal # sign
)? # this group is optional
(
[\d\w] # same as [\w], explained above
[-\d\w]{0,253} # includes a dash (-) as a valid character
# between 0 and 253 of these characters
[\d\w] # end with \w. They want at most 255
# total and - cannot be at the start
# or end
\. # literal period
)+ # at least one of these groups
[\w]{2,4} # two to four \w characters
(
: # literal colon
[\d]+ # at least one digit
)?
(
\/ # literal slash
(
[-+_~.\d\w] # one of these characters
| # *or*
% # % with two hex digit combo
[a-fA-f\d]{2,2}
)* # zero or more of these groups
)* # zero or more of these groups
(
\? # literal question mark
(
&? # literal &amp or & (semicolon optional)
(
[-+_~.\d\w]
|
%
[a-fA-f\d]{2,2}
)
=? # optional literal =
)* # zero or more of this group
)? # this group is optional
(
# # literal #
(
[-+_~.\d\w]
|
%
[a-fA-f\d]{2,2}
)*
)?
$ # anchor to end of line
It's important to understand what the metacharacters/sequences are. Some sequences are not meta when used in certain contexts (especially a character class). I've cataloged them for you:
meta with no context
^ -- zero width start of line
() -- grouping/capture
? -- zero or one of the preceding sequence
+ -- one or more of the preceding sequence
* -- zero or more of the preceding sequence
[] -- character class
\w -- alphanumeric characters and _. Opposite of \W
| -- alternation
{} -- length assertion
$ -- zero width end of line
This excludes :, #, and % from having any special/meta meaning in the raw context.
meta inside character class
] ends the character class. - creates a range of characters unless it is at the start or the end of the character class or escaped with a backslash.
grouping assertions
A (? combination starts a grouping assertion. For example, (?: means group but do not capture. This means that in the regex /(?:a)/, it will match the string "a", but a is not captured for use in replacement or match groups as it would be from /(a)/.
? can also be used for lookahead/lookbehind assertions with ?=, ?!, ?<=, ?<!. (? followed by any sequence except what I mentioned in this section is just a literal ?.

There is no special use for colon : in your case :
(([\w]+:)?\/\/)? will match http://, https://, ftp://...
You can find one special use for colon : every capturing group starting by (?: won't appear in the results.
Example, with "foobarbaz" in input :
/foo((bar)(baz))/ => { [1] => 'barbaz', [2] => 'bar', [3] => 'baz' }
/foo(?:(bar)(baz))/ => { [1] => 'bar', [2] => 'baz' }

A colon has no special meaning in Regular Expressions, it just matches a literal colon.
[\w]+:
This just means any word character 1 or more times followed by a literal colon
The brackets are actually not needed here. Square brackets are used to define a group of characters to match. So
[abcd]
means a single character of a, b, c, d

Related

Regex catching adjacent characters with a single character set

I am trying to construct a regex statement that matches a string conforming to the following conditions:
3-63 lowercase alphanumeric characters, plus "." and "-"
May not start or end with . or -
Dashes and periods cannot be adjacent to each other.
abc-123.xyz <- should match
abc123-.xyz <- should not match
I have been able to put this regex together, but it does not catch the third requirement. I've tried to use another negative lookahead/lookbehind,[i.e. - (?!.-|-.) ] but its still matching the strings with adjacent periods and dashes. Here's the regex statement I came up with that fulfills conditions 1 & 2:
^(?!\.|-)([a-z0-9]|\.|-){3,63}(?<!\.|-)$
FYI, this regex is for validating input when specifiying an AWS S3 bucket name in a CloudFormation template.
How about:
^(?=.{3,63}$)[a-z0-9]+(?:[-.][a-z0-9]+)*$
Use this Pattern ^(?!.*[.-](?=[.-]))[^.-][a-z0-9.-]{1,61}[^.-]$ Demo
# ^(?!.*[.-](?=[.-]))[^.-][a-z0-9.-]{1,61}[^.-]$
^ # Start of string/line
(?! # Negative Look-Ahead
. # Any character except line break
* # (zero or more)(greedy)
[.-] # Character in [.-] Character Class
(?= # Look-Ahead
[.-] # Character in [.-] Character Class
) # End of Look-Ahead
) # End of Negative Look-Ahead
[^.-] # Character not in [.-] Character Class
[a-z0-9.-] # Character in [a-z0-9.-] Character Class
{1,61} # (repeated {1,61} times)
[^.-] # Character not in [.-] Character Class
$ # End of string/line
^[a-z0-9](?:[a-z0-9]|[.\-](?=[a-z0-9])){2,62}$
We match a lowercase alphanumeric character, followed by between 2 and 62 repetitions of either:
a lowercase alphanumeric character, or
a . or - (which must be followed by a lowercase alphanumeric character).
The last restriction makes sure that you can't have two ./- characters in a row, or a ./- at the end of the string.

Grails/Groovy regular expression- how to use (?i) to make everything case insensitive?

I use the following RegEx:
url (blank:false, matches: /^(https?:\/\/)(?:[A-Za-z0-9]+([\-\.][A-Za-z0-9]+)*\.)+[A-Za-z]{2,40}(:[1-9][0-9]{0,4})?(\/\S*)?/)
I want to add (?i) to make everything case insensitive. How should I add this?
I can confirm the (?i) at the beginning of the regex makes it case insensitive.
Anyway, if your purpose is to reduce the regex length you can use the groovy dollar slashy string form. It allows you to not escape slashes / (the escape char becomes $).
In addition:
the POSIX chars \p{Alnum} is the compact equivalent of [0-9a-zA-Z] (this way you can avoid to use the (?i) at all).
remove unneeded backslashed dash from char class [\-\.] -> [-.] (it's not mandatory when the dash is the first or the last element and also the dot is always literal inside a character group).
remove unneeded round brackets from the protocol section
In the following version I take advantage of the multiline support of dollar slashy string and the free-spacing regex flag (?x):
$/(?x)
^ # start of the string
https?:// # http:// or https://, no need of round brackets
( # start group 1, have to be a non capturing (?: ... ) but is less readable
\p{Alnum}+ # one or more alphanumeric char instead of [a-zA-Z0-9]
([.-]\p{Alnum}+)* # zero or more of (literal dot or dash followed by one or more [a-zA-Z0-9])
\. # a literal dot
)+ # repeat the group 1 one or more
\p{Alpha}{2,40} # between 2 and 40 alphabetic chars [a-zA-Z]
(:[1-9][0-9]{0,4})? # [optional] a literal colon ':' followed by at least one non zero digit till 5 digits
(/\S*)? # [optional] a literal slash '/' followed by zero or more non-space chars
/$
A dollar-slashy compact version:
$/^https?://(\p{Alnum}+([.-]\p{Alnum}+)*\.)+\p{Alpha}{2,40}([1-9][0-9]{0,4})?(/\S*)?/$
If you must use the slashy version this is an equivalent:
/^https?:\/\/(?:\p{Alnum}+([.-]\p{Alnum}+)*\.)+\p{Alpha}{2,40}(:[1-9][0-9]{0,4})?(\/\S*)?/
A snippet of code to test all these regex:
def multiline_pattern = $/(?x)
^ # start of the string
https?:// # http:// or https://, no need of round bracket
( # start group 1, have to be a non capturing (?: ... ) but is less readable
\p{Alnum}+ # one or more alphanumeric char, instead of [a-zA-Z0-9]
([.-]\p{Alnum}+)* # zero or more of (literal dot or dash followed by one or more [0-9a-zA-Z])
\. # a literal dot
)+ # repeat the group 1 one or more
\p{Alpha}{2,40} # between 2 and 40 alphabetic chars [a-zA-Z]
(:[1-9][0-9]{0,4})? # [optional] a literal colon ':' followed by at least one non zero digit till 5 digits
(/\S*)? # [optional] a literal slash '/' followed by zero or more non-space chars
/$
def compact_pattern = $/^https?://(\p{Alnum}+([.-]\p{Alnum}+)*\.)+\p{Alpha}{2,40}(:[1-9][0-9]{0,4})?(/\S*)?/$
def slashy_pattern = /^https?:\/\/(?:\p{Alnum}+([.-]\p{Alnum}+)*\.)+\p{Alpha}{2,40}(:[1-9][0-9]{0,4})?(\/\S*)?/
def url1 = 'https://www.example-test.domain.com:12344/aloha/index.html'
def notUrl1 = 'htxps://www.example-test.domain.com:12344/aloha/index.html'
def notUrl2 = 'https://www.example-test.domain.com:02344/aloha/index.html'
assert url1 ==~ multiline_pattern
assert url1 ==~ compact_pattern
assert url1 ==~ slashy_pattern
assert !( notUrl1 ==~ compact_pattern )
assert !( notUrl1 ==~ slashy_pattern )
assert !( notUrl1 ==~ slashy_pattern )
assert !( notUrl2 ==~ compact_pattern )
assert !( notUrl2 ==~ slashy_pattern )
assert !( notUrl2 ==~ slashy_pattern )
You place them in the regexp - like in java:
groovy:000> "http://example.COM" ==~ /^(https?:\/\/)(?:[a-z0-9]+([\-\.][a-z0-9]+)*\.)+[a-z]{2,40}(:[1-9][0-9]{0,4})?(\/\S*)?/
===> false
groovy:000> "http://example.COM" ==~ /^(?i)(https?:\/\/)(?:[a-z0-9]+([\-\.][a-z0-9]+)*\.)+[a-z]{2,40}(:[1-9][0-9]{0,4})?(\/\S*)?/
===> true

Exact string coldfusion regular expression

I am using a regular expression to replace all characters that are not equal to the exact word "NULL" and also keep all digits. I did a first step, by replacing all "NULL" words from my string with this :
<cfset data = ReReplaceNoCase("123NjyfjUghfLL|NULL|NULL|NULL","\bNULL\b","","ALL")>
It removes all instances of the exact "NULL" word, that means it does not remove letters "N", "U" and "L" from the substring "123NjyfjUghfLL". And this is correct. But now, I want to reverse that. I want to keep only "NULL" word, meaning that it removes single "L", "U" and "L". So I tried that :
<cfset data = ReReplaceNoCase("123NjyfjUghfLL|NULL|NULL|NULL","[^\bNULL\b]","","ALL")>
But now this keeps all "N", "U" and "L" letters, so it outputs "NULLNULLNULLNULL". There should be only 3 times "NULL".
Can someone help me with this please? And where to add the extra code to keep digits? Thank you.
You can do this
<cfset data = ReReplaceNoCase("123NjyfjUghfLL|NULL|NULL|NULL","(^|\|)(?!NULL(?:$|\|))([^|]*)(?=$|\|)","\1","ALL")>
(^|\|)(?!NULL(?:$|\|))([^|]*)(?=$|\|)
Explanation:
( # Opens Capture Group 1
^ # Anchors to the beginning to the string.
| # Alternation (CG1)
\| # Literal |
) # Closes CG1
(?! # Opens Negative Lookahead
NULL # Literal NULL
(?: # Opens Non-Capturing group
$ # Anchors to the end to the string.
| # Alternation (NCG)
\| # Literal |
) # Closes NCG
) # Closes NLA
( # Opens Capture Group 2
[^|]* # Negated Character class (excludes the characters within)
# None of: |
# * repeats zero or more times
) # Closes CG2
(?= # Opens LA
$ # Anchors to the end to the string.
| # Alternation (LA)
\| # Literal |
) # Closes LA
Regex101.com demo
Lastly, some insight about character classes (content between square brackets)
What [^\bNULL\b] means is
[^\bNULL\b] # Negated Character class (excludes the characters within)
# None of: \b,N,U,L
# When \b is inside a character class, it matches a backspace character.
# Outside of a character class, \b matches a word boundary as you use it in your first code.
Character classes are not designed for matching or ignoring words, they're designed for permitting or excluding characters or ranges of characters.
Edit:
Ok so it works well. But what if I would like to keep also the digits? I am a kind of lost in this line of code and I cannot find where to put extra code... I think the extra code would be [^0-9] right?
This regex (demo) works to also permit numbers of any length where the number is the entire value
(^|\|)(?!(?:NULL|[0-9]+)(?:$|\|))([^|]*)(?=$|\|)
You can also use this regex (demo) to permit numbers with a decimal value.
(^|\|)(?!(?:NULL|[0-9]+(?:\.[0-9]+)?)(?:$|\|))([^|]*)(?=$|\|)

RegEx to replace prefix and postfix

I would like to build a RegEx expression to replace the prefix and postfix of a string. the general string is built from
a known prefix string
some letter a-z or A-Z
some unknown string with letters, hyphens, backslash, slash and numbers.
a hyphen
an integer number
the symbols #.
some string of letters
Examples:
KnownStringr/df-2e\d-3724#.Gkjsu
KnownStringEd\e4v-bn-824#.YKfg
KnownStringa-YK224E\yy-379924#.awws
I would like to replace the prefix and postfix of the NUMBER so that I get:
MyPrefix3724MyPostfix
MyPrefix824MyPostfix
MyPrefix379924MyPostfix
This regex should do the trick, but you always should specify the language/framework you're using, because not all regex engines support the same features.
The number that you want to capture would be in capture group #3 ((\d+)), which most languages reference as \3
(?:KnownString)([a-zA-Z])(.*?)-(\d+)\#\.[a-zA-Z]+
Explanation:
(?: # Opens NCG
KnownString # Literal KnownString
) # Closes NCG
( # Opens CG1
[a-zA-Z] # Character class (any of the characters within)
# Anything between a and z
# Anything between A and Z
) # Closes CG1
( # Opens CG2
.*? # . denotes any single character, except for newline
# * repeats zero or more times
# ? as few times as possible
)- # Closes CG2
# Literal -
( # Opens CG3
\d+ # Token: \d (digit)
# + repeats one or more times
) # Closes CG3
\# # Literal #
\. # Literal .
[a-zA-Z]+ # Character class (any of the characters within)
# Anything between a and z
# Anything between A and Z
# + repeats one or more times
You haven't specified what the known prefix is, you should be careful to escape special characters in known string, especially period, plus sign, asterisk, question mark, and parentheses.

Regular expression captures unwanted string

I have created the following expression: (.NET regex engine)
((-|\+)?\w+(\^\.?\d+)?)
hello , hello^.555,hello^111, -hello,+hello, hello+, hello^.25, hello^-1212121
It works well except that :
it captures the term 'hello+' but without the '+' : this group should not be captured at all
the last term 'hello^-1212121' as 2 groups 'hello' and '-1212121' both should be ignored
The strings to capture are as follows :
word can have a + or a - before it
or word can have a ^ that is followed by a positive number (not necessarily an integer)
words are separated by commas and any number of white spaces (both not part of the capture)
A few examples of valid strings to capture :
hello^2
hello^.2
+hello
-hello
hello
EDIT
I have found the following expression which effectively captures all these terms, it's not really optimized but it just works :
([a-zA-Z]+(?= ?,))|((-|\+)[a-zA-Z]+(?=,))|([a-zA-Z]+\^\.?\d+)
Ok, there are some issues to tackle here:
((-|+)?\w+(\^.?\d+)?)
^ ^
The + and . should be escaped like this:
((-|\+)?\w+(\^\.?\d+)?)
Now, you'll also get -1212121 there. If your string hello is always letters, then you would change \w to [a-zA-Z]:
((-|\+)?[a-zA-Z]+(\^\.?\d+)?)
\w includes letters, numbers and underscore. So, you might want to restrict it down a bit to only letters.
And finally, to take into consideration of the completely not capturing groups, you'll have to use lookarounds. I don't know of anyway otherwise to get to the delimiters without hindering the matches:
(?<=^|,)\s*((-|\+)?[a-zA-Z]+(\^\.?\d+)?)\s*(?=,|$)
EDIT: If it cannot be something like -hello^2, and if another valid string is hello^9.8, then this one will fit better:
(?<=^|,)\s*((?:-|\+)?[a-zA-Z]+|[a-zA-Z]+\^(?:\d+)?\.?\d+)(?=\s*(?:,|$))
And lastly, if capturing the words is sufficient, we can remove the lookarounds:
([-+]?[a-zA-Z]+|[a-zA-Z]+\^(?:\d+)?\.?\d+)
It would be better if you first state what it is you are looking to extract.
You also don't indicate which Regular Expression engine you're using, which is important since they vary in their features, but...
Assuming you want to capture only:
words that have a leading + or -
words that have a trailing ^ followed by an optional period followed by one or more digits
and that words are sequences of one or more letters
I'd use:
([a-zA-Z]+\^\.?\d+|[-+][a-zA-Z]+)
which breaks down into:
( # start capture group
[a-zA-Z]+ # one or more letters - note \w matches numbers and underscores
\^ # literal
\.? # optional period
\d+ # one or more digits
| # OR
[+-]? # optional plus or minus
[a-zA-Z]+ # one or more letters or underscores
) # end of capture group
EDIT
To also capture plain words (without leading or trailing chars) you'll need to rearrange the regexp a little. I'd use:
([+-][a-zA-Z]+|[a-zA-Z]+\^(?:\.\d+|\d+\.\d+|\d+)|[a-zA-Z]+)
which breaks down into:
( # start capture group
[+-] # literal plus or minus
[a-zA-Z]+ # one or more letters - note \w matches numbers and underscores
| # OR
[a-zA-Z]+ # one or more letters
\^ # literal
(?: # start of non-capturing group
\. # literal period
\d+ # one or more digits
| # OR
\d+ # one or more digits
\. # literal period
\d+ # one or more digits
| # OR
\d+ # one or more digits
) # end of non-capturing group
| # OR
[a-zA-Z]+ # one or more letters
) # end of capture group
Also note that, per your updated requirements, this regexp captures both true non-negative numbers (i.e. 0, 1, 1.2, 1.23) as well as those lacking a leading digit (i.e. .1, .12)
FURTHER EDIT
This regexp will only match the following patterns delimited by commas:
word
word with leading plus or minus
word with trailing ^ followed by a positive number of the form \d+, \d+.\d+, or .\d+
([+-][A-Za-z]+|[A-Za-z]+\^(?:.\d+|\d+(?:.\d+)?)|[A-Za-z]+)(?=,|\s|$)
Please note that the useful match will appear in the first capture group, not the entire match.
So, in Javascript, you'd:
var src="hello , hello ,hello,+hello,-hello,hello+,hello-,hello^1,hello^1.0,hello^.1",
RE=/([+-][A-Za-z]+|[A-Za-z]+\^(?:\.\d+|\d+(?:\.\d+)?)|[A-Za-z]+)(?=,|\s|$)/g;
while(RE.test(src)){
console.log(RegExp.$1)
}
which produces:
hello
hello
hello
+hello
-hello
hello^1
hello^1.0
hello^.1